Performance of Next-Generation AI Chatbots in Gynecological Knowledge Assessment: A Comparative Pilot Study of ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus

doi:10.21203/rs.3.rs-8264890/v1

Performance of Next-Generation AI Chatbots in Gynecological Knowledge Assessment: A Comparative Pilot Study of ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus

2025 · doi:10.21203/rs.3.rs-8264890/v1

preprint OA: closed

Full text JSON View at publisher

Full text 81,952 characters · extracted from preprint-html · click to expand

Performance of Next-Generation AI Chatbots in Gynecological Knowledge Assessment: A Comparative Pilot Study of ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Performance of Next-Generation AI Chatbots in Gynecological Knowledge Assessment: A Comparative Pilot Study of ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus Huan Out, Zhen Wang This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8264890/v1 This work is licensed under a CC BY 4.0 License Status: Published Journal Publication published 03 Mar, 2026 Read the published version in Archives of Gynecology and Obstetrics → Version 1 posted 10 You are reading this latest preprint version Abstract Purpose As artificial intelligence (AI) models evolve into their next generations, their application in specialized medical fields requires rigorous validation. While large language models (LLMs) have shown promise in general medicine, their reliability in complex gynecological clinical reasoning remains under-explored. This pilot study aimed to comparatively assess the knowledge retention, safety, and reasoning limitations of advanced AI chatbots in gynecology using a constrained zero-shot multiple-choice question (MCQ) format. Methods A total of 70 text-based MCQs covering seven core gynecological modules were adapted from USMLE Step 2 CK standards. The questions were administered to four advanced AI models: ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus. To simulate a rapid-retrieval clinical scenario, models were tested under "zero-shot" conditions with a constrained prompt prohibiting reasoning steps. We performed both quantitative statistical analysis (Kruskal–Wallis, Cochran’s Q) and qualitative error analysis to identify specific failure modes. Results Contrary to expectations for advanced models, overall accuracy was unsatisfactory: Gemini-3 (32.86%), DeepSeek-V3.2 (30.00%), ChatGPT-5 (25.71%), and Claude-4.5-Opus (21.43%). Significant performance disparities were observed across modules. Notably, ChatGPT-5 scored 0.00% in Infertility , while DeepSeek-V3.2 reached 70.00% in Common Benign Conditions . Qualitative analysis revealed three critical failure patterns: (1) Semantic Association Bias (confusing high-probability diseases with symptom-specific diagnoses), (2) Spatial Anatomy Confusion, and (3) Genetic Logic Reversal. No significant correlation was found between item difficulty and accuracy (p > 0.05). Conclusion Under constrained non-reasoning prompts, even next-generation AI chatbots demonstrate unsatisfactory performance in gynecology. The qualitative analysis suggests that models often rely on probabilistic keyword matching rather than physiological simulation, leading to dangerous clinical errors (e.g., misdiagnosing adrenal enzymes). While potential exists, current reliability is insufficient for unsupervised use in gynecological education. These findings highlight the critical need for "Chain-of-Thought" prompting and human expert oversight. Gynecology Large Language Models Medical Education Clinical Reasoning Hallucination Pilot Study Figures Figure 1 Introduction The integration of artificial intelligence (AI) into healthcare has transitioned from a theoretical possibility to a practical reality, fundamentally shifting the paradigm of medical education and clinical decision support [1, 2]. Large Language Models (LLMs), trained on vast datasets of medical literature and clinical guidelines, offer the promise of democratizing access to specialized knowledge. In recent years, the rapid evolution from early models to next-generation iterations—including ChatGPT-5, Gemini-3, and Claude-4.5-Opus—has led to a surge in expectations regarding their reasoning capabilities. Medical students increasingly rely on these tools as personalized tutors, while junior clinicians utilize them for rapid information retrieval at the point of care [3, 4]. However, the application of AI in specialized medical domains such as gynecology presents unique and often underestimated challenges. Unlike basic anatomical sciences, which rely heavily on static factual recall, clinical gynecology is a dynamic field that sits at the intersection of complex pelvic anatomy, fluctuating endocrine physiology, surgical principles, and oncology [5]. Furthermore, gynecological management is heavily dependent on evolving clinical guidelines (e.g., ACOG, ASCCP, ESHRE) that are frequently updated, creating a "knowledge cutoff" risk for pre-trained models. A student asking about the management of a 16-year-old with primary amenorrhea requires an answer that integrates genetic karyotyping, anatomical evaluation, and hormonal profiling—not merely a generic textbook definition. Despite the widespread adoption of these tools, rigorous academic validation has lagged behind commercial deployment. While previous studies have evaluated older model versions in general internal medicine examinations [6, 7], there is a paucity of data concerning the performance of the newest model iterations specifically in gynecological sub-specialties. Moreover, most existing evaluations have allowed models to utilize "Chain-of-Thought" (CoT) reasoning—encouraging the AI to explain its steps—which has been shown to improve accuracy. However, this does not reflect the real-world behavior of many users who seek immediate, binary answers under time pressure. A critical safety barrier in medical AI is the phenomenon of "hallucination"—the confident generation of plausible but factually incorrect information [8]. In gynecology, where misinformation can lead to inappropriate surgical interventions (e.g., confusion between uterine artery and ureter) or missed endocrine diagnoses (e.g., CAH variants), the tolerance for error is nonexistent. To truly understand the reliability of the internal knowledge bases of these next-generation models, it is essential to evaluate them under "stress-test" conditions. Therefore, this pilot study aims to benchmark the gynecological knowledge of four widely used next-generation AI chatbots: ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus. Unlike previous studies that focused on "prompt engineering" to maximize AI scores, this study adopts a "Zero-Shot Constrained" approach combined with deep qualitative case analysis. We hypothesize that despite their advanced architecture, these models may still exhibit significant gaps in clinical logic when forced to rely on direct retrieval without reasoning support. Materials and Methods Study Design This comparative cross-sectional pilot study was conducted in June 2025. The study design focuses on the quantitative and qualitative evaluation of non-human AI agents. In accordance with institutional review board (IRB) guidelines, ethical approval was waived as the research did not involve human subjects or patient data. AI Models Evaluated To ensure a comprehensive representation of the current AI landscape, four state-of-the-art large language models (LLMs) were selected for evaluation: ChatGPT-5 (OpenAI, San Francisco, CA, USA): Representing the latest iteration of the Generative Pre-trained Transformer series. Gemini-3 (Google LLC, Mountain View, CA, USA): A multimodal model designed with enhanced reasoning capabilities. DeepSeek-V3.2 (DeepSeek Inc., China): An emerging model noted for its efficiency in academic contexts. Claude-4.5-Opus (Anthropic, San Francisco, CA, USA): A model emphasized for its safety alignment and large context window. All models were accessed via their official web-based interfaces. To prevent "learning bias," a new chat session was initiated for each specific gynecological module. Question Database Curation A dataset of 70 multiple-choice questions (MCQs) was curated to represent the core curriculum of clinical gynecology. The questions were adapted from sample items of the United States Medical Licensing Examination (USMLE) Step 2 Clinical Knowledge (CK). The items were evenly distributed (n=10 per module) across seven distinct domains: Female Reproductive Anatomy & Development Menstrual Physiology & Endocrine Regulation Common Benign Gynecologic Conditions Gynecologic Cancer Screening & Management Infertility & Assisted Reproduction Basics Contraception & Family Planning Menopause & Hormone Replacement Therapy The selected items possessed an estimated item difficulty index (Pj) ranging from 0.1 to 0.6. Questions requiring visual interpretation (e.g., histology slides) were excluded to focus solely on text-based clinical reasoning. Prompt Engineering Strategy: The "Zero-Shot" Constraint To simulate a "rapid-query" environment and test the robustness of the models' underlying weights without the aid of self-correction, we used a Zero-Shot Constrained Prompt : "You are a medical exam assistant. Read the following gynecology question. Do not explain your reasoning. Do not provide a rationale. Output ONLY the correct option letter (e.g., A, B, C)." Data Analysis Quantitative Analysis: Accuracy was defined as the percentage of correct responses. Non-parametric tests (Kruskal–Wallis, Cochran’s Q) were used due to the sample distribution. Point-Biserial Correlation assessed the relationship between item difficulty (Pj) and AI success. Qualitative Analysis: A content analysis was performed on incorrect responses to identify recurring cognitive errors. Errors were categorized into: (1) Fact Retrieval Errors, (2) Logic/Reasoning Errors, and (3) Hallucinations. Specific case studies were extracted to illustrate these failure modes. All statistical analyses were performed using IBM SPSS Statistics version 26.0. Results Overall Performance Landscape All four AI models successfully generated responses for all 70 questions. However, the quantitative results revealed a surprisingly low performance ceiling for these "next-generation" tools under the constrained experimental conditions. Gemini-3 achieved the highest nominal accuracy (23/70, 32.86%), followed by DeepSeek-V3.2 (21/70, 30.00%) and ChatGPT-5 (18/70, 25.71%). Claude-4.5-Opus demonstrated the lowest accuracy at 21.43%. The Kruskal–Wallis test indicated no statistically significant difference in the overall score distribution among the four models ( \(\:\text{H}\text{(3)=2.45,}\text{p}\text{=0.48}\) ), suggesting that despite architectural differences, all models share similar limitations in processing complex gynecological vignettes without reasoning steps. Module-Specific Performance Variations Granular analysis revealed drastic inconsistencies (Table 1 ). Outlier Performance : In Common Benign Gynecologic Conditions , DeepSeek-V3.2 achieved 70.00% accuracy, significantly outperforming Claude-4.5-Opus (0.00%, p < 0.01). The "Infertility" Blind Spot : Universal failure was observed in Infertility & Assisted Reproduction , where ChatGPT-5 scored 0.00%. Questions regarding ART protocols and hormonal ratios were consistently missed. Inconsistency : Gemini-3 led in Oncology (50.00%) but failed in Menopause (10.00%). Table 1 Comparison of Answer Accuracy Rates of AI Chatbots Across Gynecological Modules Gynecological Modules Total Questions ChatGPT − 5 Gemini − 3 DeepSeek - V3.2 Claude − 4.5 - Opus Female Reproductive Anatomy & Development 10 2/10 (20.00%) 3/10 (30.00%) 4/10 (40.00%) 4/10 (40.00%) Menstrual Physiology & Endocrine Regulation 10 3/10 (30.00%) 5/10 (50.00%) 4/10 (40.00%) 2/10 (20.00%) Common Benign Gynecologic Conditions 10 3/10 (30.00%) 5/10 (50.00%) 7/10 (70.00%) 0/10 (0.00%) Gynecologic Cancer Screening & Management 10 2/10 (20.00%) 5/10 (50.00%) 1/10 (10.00%) 3/10 (30.00%) Infertility & Assisted Reproduction Basics 10 0/10 (0.00%) 1/10 (10.00%) 1/10 (10.00%) 2/10 (20.00%) Contraception & Family Planning 10 5/10 (50.00%) 3/10 (30.00%) 1/10 (10.00%) 2/10 (20.00%) Menopause & Hormone Replacement Therapy 10 3/10 (30.00%) 1/10 (10.00%) 3/10 (30.00%) 2/10 (20.00%) Overall Statistics 70 18/70 (25.71%) 23/70 (32.86%) 21/70 (30.00%) 15/70 (21.43%) Relationship with Item Difficulty No significant correlation was found between item difficulty (Pj) and AI accuracy for any model (e.g., ChatGPT-5: r = 0.023, p = 0.85). This indicates that AI chatbots miss "easy" questions just as often as "hard" ones (Table 2 ). Table 2 Evaluation of the Impact of Item Difficulty Index on AI Chatbot Responses Qualitative Analysis of Error Patterns (Case Studies) Indicators Sample Size (N) Mean Standard Deviation (SD) Correlation Coefficient with Difficulty Index (Pj) P - value Item Difficulty Index (Pj) 70 0.27 0.198 - - ChatGPT − 5 70 0.26 0.441 0.023 0.85 Gemini − 3 70 0.33 0.472 -0.018 0.88 DeepSeek - V3.2 70 0.30 0.461 -0.056 0.65 Claude − 4.5 - Opus 70 0.21 0.411 0.037 0.76 To understand why the models failed, we conducted a qualitative content analysis of the incorrect responses based on the clinical vignettes provided. This revealed distinct failure modes where the AI prioritized probabilistic word association over clinical logic. Table 3 highlights three representative cases where models demonstrated high confidence but incorrect reasoning. Table 3 Qualitative Case Study of AI Errors in Gynecological Clinical Reasoning Case ID Module Clinical Vignette Summary Correct Diagnosis/Mechanism Common AI Error & Analysis Case A Endocrine Patient with primary amenorrhea, hypertension , and hypokalemia. 46,XX karyotype. 17-alpha-hydroxylase deficiency (Accumulation of mineralocorticoids causes HTN). AI Error : 21-hydroxylase deficiency. Analysis : Semantic Association Bias. Models ignored the "hypertension" sign and selected the most statistically common form of CAH (21-OH), which actually causes hypotension. Case B Anatomy Identification of the structure most at risk of injury during hysterectomy near the uterine artery . Ureter ("Water under the bridge"). AI Error : Internal Iliac Artery or Cardinal Ligament. Analysis : Spatial Blindness. Models failed to visualize the 3D anatomical relationship where the ureter passes inferior to the uterine artery. Case C Development 16-year-old, primary amenorrhea, absent uterus, 46,XY karyotype . Androgen Insensitivity Syndrome (AIS) . AI Error : Mullerian Agenesis (MRKH). Analysis : Hierarchical Logic Failure. Models correctly identified the anatomical defect (absent uterus) but failed to prioritize the genetic evidence (XY), confusing it with the XX-based MRKH syndrome. Discussion The "Intelligence Gap" in Specialized Medicine This pilot study provides a sobering reality check for the enthusiasm surrounding next-generation AI in healthcare. Despite the marketing regarding the advanced capabilities of ChatGPT-5 and Gemini-3, their performance in this controlled gynecological assessment (21–33% accuracy) was surprisingly poor. To put this in perspective, the passing threshold for USMLE examinations is typically around 60%. By this metric, none of the evaluated models would be considered "safe" or "competent" to practice gynecology under zero-shot conditions. This finding contradicts some previous reports of high GPT-4 performance, highlighting that specialized domains require deeper validation than general medicine benchmarks. Analyzing the Failure Modes: Why Did They Fail? Our qualitative analysis (Table 3 ) provides critical insights into the cognitive architecture of these failures. 1. Semantic Association Bias (The "Probabilistic Trap") In Case A (Endocrine), the AI models exhibited what we term "Semantic Association Bias." When presented with a vignette about "Congenital Adrenal Hyperplasia" (CAH), the models gravitated towards "21-hydroxylase deficiency" because it is the most common subtype mentioned in medical literature. However, the key clinical discriminator— hypertension —was ignored. In a clinical setting, this is dangerous. Treating a hypertensive newborn for a salt-wasting condition (21-OH deficiency) based on an AI suggestion could be fatal. This confirms that current LLMs operate as "Pattern Matchers" rather than "Physiological Simulators." 2. Spatial Blindness in Text-Based Models The failure in Case B (Anatomy) highlights a limitation of text-only training. The relationship between the ureter and the uterine artery is a visual, 3D concept ("water under the bridge"). While the AI likely "read" this phrase in its training data, it failed to apply the spatial logic to a surgical scenario involving ligament clamping. This suggests that for surgical education, text-based chatbots must be supplemented with multimodal inputs (images/3D models) to be reliable. 3. Hierarchical Logic Failure In Case C (AIS vs. MRKH), the models failed to apply the correct diagnostic hierarchy. A human clinician knows that Karyotype (XY vs XX) is the definitive discriminator for uterine agenesis. The AI, however, seemed to weight the symptom description "absent uterus" more heavily than the genetic data, leading to a misdiagnosis. The Critical Role of Prompt Engineering The low scores are likely exacerbated by the Zero-Shot Constrained Prompt ("Do not explain..."). Previous research [ 1 ] has shown that "Chain-of-Thought" (CoT) prompting can significantly improve performance. However, our negative result is scientifically valuable because it highlights a dangerous user behavior. In real-world settings, stressed medical students or busy clinicians often ask quick questions expecting quick answers. Our data proves that without the "crutch" of self-explanation, the AI's raw answer retrieval is unreliable. This implies that the AI does not truly "know" the medical concept; it reconstructs it. When the reconstruction process (reasoning) is blocked, the probability model collapses. Educational Implications: The "Stochastic Parrot" Risk A particularly concerning finding is the lack of correlation between item difficulty and AI performance (Table 2 ). In educational psychology, a competent learner shows a predictable pattern: getting easy questions right and struggling with hard ones. The AI models, however, exhibited "stochastic" behavior [ 11 ]. They might correctly identify a rare tumor marker but fail a basic contraceptive question. This unpredictability makes them dangerous educational tools for novices, who may gain false confidence from the AI's correct answers to complex questions, not realizing it has hallucinated on basic principles. Limitations This study has limitations. First, the sample size (N = 70) is relatively small, though sufficient for a pilot study to identify qualitative patterns. Second, the text-only format excludes visual diagnostic skills. Third, the "Zero-Shot" constraint represents a "stress test" scenario. Finally, the models are closed-source, preventing direct inspection of their training data regarding specific pathologies like infertility. Conclusion This comparative pilot study demonstrates that under constrained zero-shot conditions, next-generation AI chatbots (ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus) demonstrate unsatisfactory performance in gynecological clinical reasoning , with accuracy rates falling significantly below human passing standards. Key Takeaways: Unreliability: Current models cannot be trusted for unsupervised information retrieval in gynecology, particularly for differentiating complex endocrine disorders (e.g., CAH subtypes). Mechanism of Failure: Errors are often driven by "probabilistic bias," where the AI selects the most common disease association rather than analyzing specific clinical signs (e.g., hypertension). Safety Risk: The lack of correlation between difficulty and accuracy means AI failure is unpredictable. We recommend that medical educators treat these tools as "preliminary search engines" rather than "digital professors." Future implementation in curricula must mandate "Chain-of-Thought" prompting strategies to mitigate the risks of semantic bias and hallucination. Declarations Data availability The authors do not have permission to share data. Competing interest None. Funding None. Consent for publication Not applicable. Data sharing statement Not applicable. References Nori H, King N, McKinney SM, et al. Capabilities of GPT-4 on Medical Challenge Problems. N Engl J Med AI . 2024;1(1):AIoa2300038. Thirunavukarasu AJ, Ting DSJ, Elangovan K, et al. Large language models in medicine. Nat Med . 2023;29:1930–1940. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health . 2023;2(2):e0000198. Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ . 2023;9:e45312. Singhal K, Azizi S, Tu T, et al. Large language models encode clinical knowledge. Nature . 2023;620:172–180. Strong E, DiGiammarino A, Weng Y, et al. Chatbot vs Medical Student Performance on Free-Response Clinical Reasoning Examinations. JAMA Intern Med . 2023;183(9):1028–1030. Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. Lancet Digit Health . 2023;5(12):e858-e860. Li H, Moon JT, Purkayastha S, et al. Ethics of large language models in medicine and medical education. Lancet Digit Health . 2023;5(6):e333-e335. Dave T, Athaluri SA, Singh S. ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. Front Artif Intell . 2023;6:1169595. Lee P, Bubeck S, Petro J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med . 2023;388:1233-1235. Bender EM, Gebru T, McMillan-Major A, Shmitchell S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? FAccT '21 . 2021:610-623. Haupt CE, Marks M. AI-Generated Medical Advice—GPT and Beyond. JAMA . 2023;329(16):1349–1350. Wong RS, Ming LC, Dang CP, et al. Medical student perceptions of the application of artificial intelligence in medical education: a systematic review. J Med Educ Curric Dev . 2024;11:23821205241226342. Clusmann J, Kolb-Lagen B, Drutschke D, et al. The future landscape of large language models in medicine. Commun Med (Lond) . 2023;3:141. De Angelis L, Baglivo F, Arzilli G, et al. ChatGPT and the rise of large language models: the new AI-driven infodemic threat in academia. BMC Res Notes . 2023;16:294. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Published Journal Publication published 03 Mar, 2026 Read the published version in Archives of Gynecology and Obstetrics → Version 1 posted Editorial decision: Revision requested 07 Jan, 2026 Reviews received at journal 05 Jan, 2026 Reviews received at journal 04 Jan, 2026 Reviewers agreed at journal 23 Dec, 2025 Reviewers agreed at journal 23 Dec, 2025 Reviewers agreed at journal 21 Dec, 2025 Reviewers invited by journal 10 Dec, 2025 Editor assigned by journal 09 Dec, 2025 Submission checks completed at journal 08 Dec, 2025 First submitted to journal 02 Dec, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8264890","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":559574638,"identity":"e1fe15a9-95dd-4091-9c66-9c8d7583e23d","order_by":0,"name":"Huan Out","email":"","orcid":"","institution":"","correspondingAuthor":false,"prefix":"","firstName":"Huan","middleName":"","lastName":"Out","suffix":""},{"id":559574639,"identity":"37531ff8-97af-4227-b23f-56a59eca72d3","order_by":1,"name":"Zhen Wang","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA5UlEQVRIie3RIQvCQBTA8SeDpc3VN5TtKwwOTIJf5R7CJUXjguFA2YLaFfwQRuNgMMuJVbAoglkRxKhYlW02w/3SC/eHe3cAmvaHTH97O/AQPcdP6DUMipMqQiO4qCZzZ3AKDiorTjyEwJ2PBC13lbN7HBklLlaTnNkyZbAfipCkCU485vlJPUlO9ir1KotM7GhVB1SbZX4CJJmtUmZA55UoEwLsFiVtqNlRShH07n2KjBIJCnDnkaAJcgHlEkuZ70dGK2kjV5lVuIsfT4z3V7bWkq6PcOA58TQ/+WD9dlzTNE376gn0jk2GRwdNzQAAAABJRU5ErkJggg==","orcid":"","institution":"","correspondingAuthor":true,"prefix":"","firstName":"Zhen","middleName":"","lastName":"Wang","suffix":""}],"badges":[],"createdAt":"2025-12-03 01:53:12","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8264890/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8264890/v1","draftVersion":[],"editorialEvents":[{"content":"https://doi.org/10.1007/s00404-026-08358-7","type":"published","date":"2026-03-03T15:58:48+00:00"}],"editorialNote":"","failedWorkflow":false,"files":[{"id":98313340,"identity":"f63902c5-5b3e-447f-b04d-0d843ef7de52","added_by":"auto","created_at":"2025-12-16 12:53:10","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":66532,"visible":true,"origin":"","legend":"","description":"","filename":"Manuscript.docx","url":"https://assets-eu.researchsquare.com/files/rs-8264890/v1/e6a584ca210fdf8b70b459fb.docx"},{"id":98313343,"identity":"58eb9720-389e-44d3-9d95-59d1772d3810","added_by":"auto","created_at":"2025-12-16 12:53:10","extension":"jpg","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":263315,"visible":true,"origin":"","legend":"","description":"","filename":"Figure1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8264890/v1/ff6554e7ef3eb612cc115c3d.jpg"},{"id":98313335,"identity":"ed89c657-dfe8-4e85-92c3-e08fa466240e","added_by":"auto","created_at":"2025-12-16 12:53:10","extension":"xlsx","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":10865,"visible":true,"origin":"","legend":"","description":"","filename":"Table1.xlsx","url":"https://assets-eu.researchsquare.com/files/rs-8264890/v1/f8ac41b4b234ebfaae83e73b.xlsx"},{"id":98313345,"identity":"6210d6f5-bb94-424a-992e-51b4f586afff","added_by":"auto","created_at":"2025-12-16 12:53:10","extension":"json","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":5195,"visible":true,"origin":"","legend":"","description":"","filename":"e4204c77c7f8451c922933612a021668.json","url":"https://assets-eu.researchsquare.com/files/rs-8264890/v1/dc9333efed8a82a31d428bc5.json"},{"id":98313348,"identity":"f76c79e1-0bc0-4612-82c5-4b69de82a38a","added_by":"auto","created_at":"2025-12-16 12:53:10","extension":"docx","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":11562,"visible":true,"origin":"","legend":"","description":"","filename":"TitlePage.docx","url":"https://assets-eu.researchsquare.com/files/rs-8264890/v1/0fe005af346d2b69fb957e51.docx"},{"id":98435395,"identity":"c1201c56-5153-4e04-95d3-bf8022c8e267","added_by":"auto","created_at":"2025-12-17 16:53:39","extension":"xml","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":62228,"visible":true,"origin":"","legend":"","description":"","filename":"e4204c77c7f8451c922933612a0216681enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8264890/v1/1137f1fc7b9e987f127ed721.xml"},{"id":98436209,"identity":"d64a8fa6-1396-465f-b557-60cb4f0bca2c","added_by":"auto","created_at":"2025-12-17 16:55:06","extension":"jpg","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":263315,"visible":true,"origin":"","legend":"","description":"","filename":"Figure1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8264890/v1/ec4744891dae1e1e97e496b7.jpg"},{"id":98313347,"identity":"16f31913-5cc1-447e-852b-9fc836ba2f5f","added_by":"auto","created_at":"2025-12-16 12:53:10","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":58423,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8264890/v1/1330f1a142cf9987b81972ee.png"},{"id":98313337,"identity":"c234ab95-a8c8-4df3-b84e-d3dda4d30a6a","added_by":"auto","created_at":"2025-12-16 12:53:10","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":77546,"visible":true,"origin":"","legend":"","description":"","filename":"OnlineFigure1.png","url":"https://assets-eu.researchsquare.com/files/rs-8264890/v1/d965cdcfc4ae7cb5c7433553.png"},{"id":98313338,"identity":"38031fb3-2e41-4e14-8c6b-d8c9fa476c95","added_by":"auto","created_at":"2025-12-16 12:53:10","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":18997,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8264890/v1/f4feefe9b7a49c3ad0b51b19.png"},{"id":98313342,"identity":"8717b343-fc06-4d6c-af71-b2fb7357a047","added_by":"auto","created_at":"2025-12-16 12:53:10","extension":"xml","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":59655,"visible":true,"origin":"","legend":"","description":"","filename":"e4204c77c7f8451c922933612a0216681structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8264890/v1/2d73745817f16ba7b34d1614.xml"},{"id":98313341,"identity":"217425af-15e6-45fd-804b-c2c3acfbeb96","added_by":"auto","created_at":"2025-12-16 12:53:10","extension":"html","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":71930,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8264890/v1/9b91eac126b28f0ae1c34f90.html"},{"id":98313336,"identity":"3281dfdd-e10b-46ea-9b16-78c6bc1546c0","added_by":"auto","created_at":"2025-12-16 12:53:10","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":40734,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eOverall accuracy comparison of four AI chatbots in gynecological multiple-choice questions\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-8264890/v1/e3b990a37959edb14477b582.png"},{"id":104251805,"identity":"1627d4b1-4965-49a6-b965-9d2ab7a8f1b5","added_by":"auto","created_at":"2026-03-09 16:15:25","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1454998,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8264890/v1/095b2579-a9ff-40da-b3bb-6631fc50bde0.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Performance of Next-Generation AI Chatbots in Gynecological Knowledge Assessment: A Comparative Pilot Study of ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe integration of artificial intelligence (AI) into healthcare has transitioned from a theoretical possibility to a practical reality, fundamentally shifting the paradigm of medical education and clinical decision support [1, 2]. Large Language Models (LLMs), trained on vast datasets of medical literature and clinical guidelines, offer the promise of democratizing access to specialized knowledge. In recent years, the rapid evolution from early models to next-generation iterations—including ChatGPT-5, Gemini-3, and Claude-4.5-Opus—has led to a surge in expectations regarding their reasoning capabilities. Medical students increasingly rely on these tools as personalized tutors, while junior clinicians utilize them for rapid information retrieval at the point of care [3, 4].\u003c/p\u003e\n\u003cp\u003eHowever, the application of AI in specialized medical domains such as gynecology presents unique and often underestimated challenges. Unlike basic anatomical sciences, which rely heavily on static factual recall, clinical gynecology is a dynamic field that sits at the intersection of complex pelvic anatomy, fluctuating endocrine physiology, surgical principles, and oncology [5]. Furthermore, gynecological management is heavily dependent on evolving clinical guidelines (e.g., ACOG, ASCCP, ESHRE) that are frequently updated, creating a \"knowledge cutoff\" risk for pre-trained models. A student asking about the management of a 16-year-old with primary amenorrhea requires an answer that integrates genetic karyotyping, anatomical evaluation, and hormonal profiling—not merely a generic textbook definition.\u003c/p\u003e\n\u003cp\u003eDespite the widespread adoption of these tools, rigorous academic validation has lagged behind commercial deployment. While previous studies have evaluated older model versions in general internal medicine examinations [6, 7], there is a paucity of data concerning the performance of the newest model iterations specifically in gynecological sub-specialties. Moreover, most existing evaluations have allowed models to utilize \"Chain-of-Thought\" (CoT) reasoning—encouraging the AI to explain its steps—which has been shown to improve accuracy. However, this does not reflect the real-world behavior of many users who seek immediate, binary answers under time pressure.\u003c/p\u003e\n\u003cp\u003eA critical safety barrier in medical AI is the phenomenon of \"hallucination\"—the confident generation of plausible but factually incorrect information [8]. In gynecology, where misinformation can lead to inappropriate surgical interventions (e.g., confusion between uterine artery and ureter) or missed endocrine diagnoses (e.g., CAH variants), the tolerance for error is nonexistent. To truly understand the reliability of the internal knowledge bases of these next-generation models, it is essential to evaluate them under \"stress-test\" conditions.\u003c/p\u003e\n\u003cp\u003eTherefore, this pilot study aims to benchmark the gynecological knowledge of four widely used next-generation AI chatbots: ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus. Unlike previous studies that focused on \"prompt engineering\" to maximize AI scores, this study adopts a \"Zero-Shot Constrained\" approach combined with deep qualitative case analysis. We hypothesize that despite their advanced architecture, these models may still exhibit significant gaps in clinical logic when forced to rely on direct retrieval without reasoning support.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003cp\u003e\u003cstrong\u003eStudy Design\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis comparative cross-sectional pilot study was conducted in June 2025. The study design focuses on the quantitative and qualitative evaluation of non-human AI agents. In accordance with institutional review board (IRB) guidelines, ethical approval was waived as the research did not involve human subjects or patient data.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAI Models Evaluated\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo ensure a comprehensive representation of the current AI landscape, four state-of-the-art large language models (LLMs) were selected for evaluation:\u003c/p\u003e\n\u003col start=\"1\" type=\"1\"\u003e\n \u003cli\u003e\u003cstrong\u003eChatGPT-5\u003c/strong\u003e (OpenAI, San Francisco, CA, USA): Representing the latest iteration of the Generative Pre-trained Transformer series.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eGemini-3\u003c/strong\u003e (Google LLC, Mountain View, CA, USA): A multimodal model designed with enhanced reasoning capabilities.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eDeepSeek-V3.2\u003c/strong\u003e (DeepSeek Inc., China): An emerging model noted for its efficiency in academic contexts.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eClaude-4.5-Opus\u003c/strong\u003e (Anthropic, San Francisco, CA, USA): A model emphasized for its safety alignment and large context window.\u003cbr\u003e\u0026nbsp;All models were accessed via their official web-based interfaces. To prevent \"learning bias,\" a new chat session was initiated for each specific gynecological module.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003e\u003cstrong\u003eQuestion Database Curation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eA dataset of 70 multiple-choice questions (MCQs) was curated to represent the core curriculum of clinical gynecology. The questions were adapted from sample items of the United States Medical Licensing Examination (USMLE) Step 2 Clinical Knowledge (CK). The items were evenly distributed (n=10 per module) across seven distinct domains:\u003c/p\u003e\n\u003cul type=\"disc\"\u003e\n \u003cli\u003e\u003cem\u003eFemale Reproductive Anatomy \u0026amp; Development\u003c/em\u003e\u003c/li\u003e\n \u003cli\u003e\u003cem\u003eMenstrual Physiology \u0026amp; Endocrine Regulation\u003c/em\u003e\u003c/li\u003e\n \u003cli\u003e\u003cem\u003eCommon Benign Gynecologic Conditions\u003c/em\u003e\u003c/li\u003e\n \u003cli\u003e\u003cem\u003eGynecologic Cancer Screening \u0026amp; Management\u003c/em\u003e\u003c/li\u003e\n \u003cli\u003e\u003cem\u003eInfertility \u0026amp; Assisted Reproduction Basics\u003c/em\u003e\u003c/li\u003e\n \u003cli\u003e\u003cem\u003eContraception \u0026amp; Family Planning\u003c/em\u003e\u003c/li\u003e\n \u003cli\u003e\u003cem\u003eMenopause \u0026amp; Hormone Replacement Therapy\u003c/em\u003e\u003c/li\u003e\n\u003c/ul\u003e\n\u003cp\u003eThe selected items possessed an estimated item difficulty index (Pj) ranging from 0.1 to 0.6. Questions requiring visual interpretation (e.g., histology slides) were excluded to focus solely on text-based clinical reasoning.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePrompt Engineering Strategy: The \"Zero-Shot\" Constraint\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eTo simulate a \"rapid-query\" environment and test the robustness of the models' underlying weights without the aid of self-correction, we used a\u0026nbsp;\u003cstrong\u003eZero-Shot Constrained Prompt\u003c/strong\u003e:\u003cbr\u003e\u003cem\u003e\"You are a medical exam assistant. Read the following gynecology question. Do not explain your reasoning. Do not provide a rationale. Output ONLY the correct option letter (e.g., A, B, C).\"\u003c/em\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eQuantitative Analysis:\u003c/strong\u003e Accuracy was defined as the percentage of correct responses. Non-parametric tests (Kruskal–Wallis, Cochran’s Q) were used due to the sample distribution. Point-Biserial Correlation assessed the relationship between item difficulty (Pj) and AI success.\u003cbr\u003e\u003cstrong\u003eQualitative Analysis:\u003c/strong\u003e A content analysis was performed on incorrect responses to identify recurring cognitive errors. Errors were categorized into: (1) Fact Retrieval Errors, (2) Logic/Reasoning Errors, and (3) Hallucinations. Specific case studies were extracted to illustrate these failure modes.\u003cbr\u003e\u0026nbsp;All statistical analyses were performed using IBM SPSS Statistics version 26.0.\u003c/p\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eOverall Performance Landscape\u003c/h2\u003e \u003cp\u003eAll four AI models successfully generated responses for all 70 questions. However, the quantitative results revealed a surprisingly low performance ceiling for these \"next-generation\" tools under the constrained experimental conditions.\u003c/p\u003e \u003cp\u003eGemini-3 achieved the highest nominal accuracy (23/70, 32.86%), followed by DeepSeek-V3.2 (21/70, 30.00%) and ChatGPT-5 (18/70, 25.71%). Claude-4.5-Opus demonstrated the lowest accuracy at 21.43%.\u003c/p\u003e \u003cp\u003eThe Kruskal\u0026ndash;Wallis test indicated no statistically significant difference in the overall score distribution among the four models (\u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\text{H}\\text{(3)=2.45,}\\text{p}\\text{=0.48}\\)\u003c/span\u003e\u003c/span\u003e), suggesting that despite architectural differences, all models share similar limitations in processing complex gynecological vignettes without reasoning steps.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eModule-Specific Performance Variations\u003c/h2\u003e \u003cp\u003eGranular analysis revealed drastic inconsistencies (Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003cul\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eOutlier Performance\u003c/b\u003e: In \u003cem\u003eCommon Benign Gynecologic Conditions\u003c/em\u003e, DeepSeek-V3.2 achieved 70.00% accuracy, significantly outperforming Claude-4.5-Opus (0.00%, p\u0026thinsp;\u0026lt;\u0026thinsp;0.01).\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eThe \"Infertility\" Blind Spot\u003c/b\u003e: Universal failure was observed in \u003cem\u003eInfertility \u0026amp; Assisted Reproduction\u003c/em\u003e, where ChatGPT-5 scored 0.00%. Questions regarding ART protocols and hormonal ratios were consistently missed.\u003c/p\u003e \u003c/li\u003e \u003cli\u003e \u003cp\u003e \u003cb\u003eInconsistency\u003c/b\u003e: Gemini-3 led in \u003cem\u003eOncology\u003c/em\u003e (50.00%) but failed in \u003cem\u003eMenopause\u003c/em\u003e (10.00%).\u003c/p\u003e \u003c/li\u003e \u003c/ul\u003e \u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eComparison of Answer Accuracy Rates of AI Chatbots Across Gynecological Modules\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGynecological Modules\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTotal Questions\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eChatGPT \u0026minus;\u0026thinsp;5\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eGemini \u0026minus;\u0026thinsp;3\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eDeepSeek - V3.2\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eClaude \u0026minus;\u0026thinsp;4.5 - Opus\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eFemale Reproductive Anatomy \u0026amp; Development\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2/10 (20.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3/10 (30.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e4/10 (40.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e4/10 (40.00%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMenstrual Physiology \u0026amp; Endocrine Regulation\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3/10 (30.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e5/10 (50.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e4/10 (40.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e2/10 (20.00%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCommon Benign Gynecologic Conditions\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3/10 (30.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e5/10 (50.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e7/10 (70.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0/10 (0.00%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGynecologic Cancer Screening \u0026amp; Management\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e2/10 (20.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e5/10 (50.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1/10 (10.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e3/10 (30.00%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eInfertility \u0026amp; Assisted Reproduction Basics\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0/10 (0.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1/10 (10.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1/10 (10.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e2/10 (20.00%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eContraception \u0026amp; Family Planning\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e5/10 (50.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e3/10 (30.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e1/10 (10.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e2/10 (20.00%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eMenopause \u0026amp; Hormone Replacement Therapy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e3/10 (30.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e1/10 (10.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e3/10 (30.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e2/10 (20.00%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eOverall Statistics\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e18/70 (25.71%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e23/70 (32.86%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e21/70 (30.00%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e15/70 (21.43%)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eRelationship with Item Difficulty\u003c/h2\u003e \u003cp\u003eNo significant correlation was found between item difficulty (Pj) and AI accuracy for any model (e.g., ChatGPT-5: r\u0026thinsp;=\u0026thinsp;0.023, p\u0026thinsp;=\u0026thinsp;0.85). This indicates that AI chatbots miss \"easy\" questions just as often as \"hard\" ones (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e).\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eEvaluation of the Impact of Item Difficulty Index on AI Chatbot Responses Qualitative Analysis of Error Patterns (Case Studies)\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eIndicators\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSample Size (N)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eMean\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eStandard Deviation (SD)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eCorrelation Coefficient with Difficulty Index (Pj)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eP - value\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eItem Difficulty Index (Pj)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.27\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.198\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eChatGPT \u0026minus;\u0026thinsp;5\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003e70\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0.26\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.441\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0.023\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0.85\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eGemini \u0026minus;\u0026thinsp;3\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e70\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.33\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.472\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e-0.018\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e0.88\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eDeepSeek - V3.2\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e70\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.30\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.461\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e-0.056\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e0.65\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eClaude \u0026minus;\u0026thinsp;4.5 - Opus\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e70\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e0.21\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.411\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.037\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e0.76\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eTo understand \u003cem\u003ewhy\u003c/em\u003e the models failed, we conducted a qualitative content analysis of the incorrect responses based on the clinical vignettes provided. This revealed distinct failure modes where the AI prioritized probabilistic word association over clinical logic. Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e highlights three representative cases where models demonstrated high confidence but incorrect reasoning.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eQualitative Case Study of AI Errors in Gynecological Clinical Reasoning\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eCase ID\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eModule\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eClinical Vignette Summary\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eCorrect Diagnosis/Mechanism\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eCommon AI Error \u0026amp; Analysis\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eCase A\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eEndocrine\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePatient with primary amenorrhea,\u0026nbsp;\u003cb\u003ehypertension\u003c/b\u003e, and hypokalemia. 46,XX karyotype.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e17-alpha-hydroxylase deficiency\u003c/b\u003e\u0026nbsp;(Accumulation of mineralocorticoids causes HTN).\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003eAI Error\u003c/b\u003e:\u0026nbsp;21-hydroxylase deficiency.\u0026lt;br\u0026thinsp;\u0026gt;\u0026thinsp;\u003cb\u003eAnalysis\u003c/b\u003e:\u0026nbsp;\u003cem\u003eSemantic Association Bias.\u003c/em\u003e\u0026nbsp;Models ignored the \"hypertension\" sign and selected the most statistically common form of CAH (21-OH), which actually causes hypotension.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eCase B\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eAnatomy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eIdentification of the structure most at risk of injury during hysterectomy near the\u0026nbsp;\u003cb\u003euterine artery\u003c/b\u003e.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003eUreter\u003c/b\u003e\u0026nbsp;(\"Water under the bridge\").\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003eAI Error\u003c/b\u003e:\u0026nbsp;Internal Iliac Artery or Cardinal Ligament.\u0026lt;br\u0026thinsp;\u0026gt;\u0026thinsp;\u003cb\u003eAnalysis\u003c/b\u003e:\u0026nbsp;\u003cem\u003eSpatial Blindness.\u003c/em\u003e\u0026nbsp;Models failed to visualize the 3D anatomical relationship where the ureter passes inferior to the uterine artery.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eCase C\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eDevelopment\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e16-year-old, primary amenorrhea, absent uterus,\u0026nbsp;\u003cb\u003e46,XY karyotype\u003c/b\u003e.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003eAndrogen Insensitivity Syndrome (AIS)\u003c/b\u003e.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003eAI Error\u003c/b\u003e:\u0026nbsp;Mullerian Agenesis (MRKH).\u0026lt;br\u0026thinsp;\u0026gt;\u0026thinsp;\u003cb\u003eAnalysis\u003c/b\u003e:\u0026nbsp;\u003cem\u003eHierarchical Logic Failure.\u003c/em\u003e\u0026nbsp;Models correctly identified the anatomical defect (absent uterus) but failed to prioritize the genetic evidence (XY), confusing it with the XX-based MRKH syndrome.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cdiv id=\"Sec19\" class=\"Section2\"\u003e \u003ch2\u003eThe \"Intelligence Gap\" in Specialized Medicine\u003c/h2\u003e \u003cp\u003eThis pilot study provides a sobering reality check for the enthusiasm surrounding next-generation AI in healthcare. Despite the marketing regarding the advanced capabilities of ChatGPT-5 and Gemini-3, their performance in this controlled gynecological assessment (21\u0026ndash;33% accuracy) was surprisingly poor. To put this in perspective, the passing threshold for USMLE examinations is typically around 60%. By this metric, none of the evaluated models would be considered \"safe\" or \"competent\" to practice gynecology under zero-shot conditions. This finding contradicts some previous reports of high GPT-4 performance, highlighting that \u003cb\u003especialized domains require deeper validation than general medicine benchmarks.\u003c/b\u003e\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003eAnalyzing the Failure Modes: Why Did They Fail?\u003c/h2\u003e \u003cp\u003eOur qualitative analysis (Table\u0026nbsp;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e) provides critical insights into the cognitive architecture of these failures.\u003c/p\u003e \u003cp\u003e \u003cb\u003e1. Semantic Association Bias (The \"Probabilistic Trap\")\u003c/b\u003e \u003c/p\u003e \u003cp\u003eIn Case A (Endocrine), the AI models exhibited what we term \"Semantic Association Bias.\" When presented with a vignette about \"Congenital Adrenal Hyperplasia\" (CAH), the models gravitated towards \"21-hydroxylase deficiency\" because it is the most common subtype mentioned in medical literature. However, the key clinical discriminator\u0026mdash;\u003cb\u003ehypertension\u003c/b\u003e\u0026mdash;was ignored. In a clinical setting, this is dangerous. Treating a hypertensive newborn for a salt-wasting condition (21-OH deficiency) based on an AI suggestion could be fatal. This confirms that current LLMs operate as \"Pattern Matchers\" rather than \"Physiological Simulators.\"\u003c/p\u003e \u003cp\u003e \u003cb\u003e2. Spatial Blindness in Text-Based Models\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe failure in Case B (Anatomy) highlights a limitation of text-only training. The relationship between the ureter and the uterine artery is a visual, 3D concept (\"water under the bridge\"). While the AI likely \"read\" this phrase in its training data, it failed to apply the spatial logic to a surgical scenario involving ligament clamping. This suggests that for surgical education, text-based chatbots must be supplemented with multimodal inputs (images/3D models) to be reliable.\u003c/p\u003e \u003cp\u003e \u003cb\u003e3. Hierarchical Logic Failure\u003c/b\u003e \u003c/p\u003e \u003cp\u003eIn Case C (AIS vs. MRKH), the models failed to apply the correct diagnostic hierarchy. A human clinician knows that Karyotype (XY vs XX) is the definitive discriminator for uterine agenesis. The AI, however, seemed to weight the symptom description \"absent uterus\" more heavily than the genetic data, leading to a misdiagnosis.\u003c/p\u003e \u003cdiv id=\"Sec21\" class=\"Section3\"\u003e \u003ch2\u003eThe Critical Role of Prompt Engineering\u003c/h2\u003e \u003cp\u003eThe low scores are likely exacerbated by the \u003cb\u003eZero-Shot Constrained Prompt\u003c/b\u003e (\"Do not explain...\"). Previous research [\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e] has shown that \"Chain-of-Thought\" (CoT) prompting can significantly improve performance. However, our negative result is scientifically valuable because it highlights a dangerous user behavior. In real-world settings, stressed medical students or busy clinicians often ask quick questions expecting quick answers. Our data proves that \u003cb\u003ewithout the \"crutch\" of self-explanation, the AI's raw answer retrieval is unreliable.\u003c/b\u003e This implies that the AI does not truly \"know\" the medical concept; it reconstructs it. When the reconstruction process (reasoning) is blocked, the probability model collapses.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003eEducational Implications: The \"Stochastic Parrot\" Risk\u003c/h2\u003e \u003cp\u003eA particularly concerning finding is the lack of correlation between item difficulty and AI performance (Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e). In educational psychology, a competent learner shows a predictable pattern: getting easy questions right and struggling with hard ones. The AI models, however, exhibited \"stochastic\" behavior [\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e]. They might correctly identify a rare tumor marker but fail a basic contraceptive question. This unpredictability makes them dangerous educational tools for novices, who may gain false confidence from the AI's correct answers to complex questions, not realizing it has hallucinated on basic principles.\u003c/p\u003e \u003cdiv id=\"Sec23\" class=\"Section3\"\u003e \u003ch2\u003eLimitations\u003c/h2\u003e \u003cp\u003eThis study has limitations. First, the sample size (N\u0026thinsp;=\u0026thinsp;70) is relatively small, though sufficient for a pilot study to identify qualitative patterns. Second, the text-only format excludes visual diagnostic skills. Third, the \"Zero-Shot\" constraint represents a \"stress test\" scenario. Finally, the models are closed-source, preventing direct inspection of their training data regarding specific pathologies like infertility.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003eThis comparative pilot study demonstrates that under constrained zero-shot conditions, \u003cstrong\u003enext-generation AI chatbots (ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus) demonstrate unsatisfactory performance in gynecological clinical reasoning\u003c/strong\u003e, with accuracy rates falling significantly below human passing standards.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eKey Takeaways:\u003c/strong\u003e\u003c/p\u003e\n\u003col start=\"1\" type=\"1\"\u003e\n \u003cli\u003e\u003cstrong\u003eUnreliability:\u003c/strong\u003e Current models cannot be trusted for unsupervised information retrieval in gynecology, particularly for differentiating complex endocrine disorders (e.g., CAH subtypes).\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eMechanism of Failure:\u003c/strong\u003e Errors are often driven by \"probabilistic bias,\" where the AI selects the most common disease association rather than analyzing specific clinical signs (e.g., hypertension).\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eSafety Risk:\u003c/strong\u003e The lack of correlation between difficulty and accuracy means AI failure is unpredictable.\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eWe recommend that medical educators treat these tools as \u003cstrong\u003e\"preliminary search engines\"\u003c/strong\u003e rather than \u003cstrong\u003e\"digital professors.\"\u003c/strong\u003e Future implementation in curricula must mandate \"Chain-of-Thought\" prompting strategies to mitigate the risks of semantic bias and hallucination.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData availability\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors do not have permission to share data.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting interest\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNone.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNone.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData sharing statement\u0026nbsp;\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003e\u003cstrong\u003eNori H, King N, McKinney SM, et al.\u003c/strong\u003e Capabilities of GPT-4 on Medical Challenge Problems. \u003cem\u003eN Engl J Med AI\u003c/em\u003e. 2024;1(1):AIoa2300038.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eThirunavukarasu AJ, Ting DSJ, Elangovan K, et al.\u003c/strong\u003e Large language models in medicine. \u003cem\u003eNat Med\u003c/em\u003e. 2023;29:1930\u0026ndash;1940.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eKung TH, Cheatham M, Medenilla A, et al.\u003c/strong\u003e Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. \u003cem\u003ePLOS Digit Health\u003c/em\u003e. 2023;2(2):e0000198.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eGilson A, Safranek CW, Huang T, et al.\u003c/strong\u003e How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. \u003cem\u003eJMIR Med Educ\u003c/em\u003e. 2023;9:e45312.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eSinghal K, Azizi S, Tu T, et al.\u003c/strong\u003e Large language models encode clinical knowledge. \u003cem\u003eNature\u003c/em\u003e. 2023;620:172\u0026ndash;180.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eStrong E, DiGiammarino A, Weng Y, et al.\u003c/strong\u003e Chatbot vs Medical Student Performance on Free-Response Clinical Reasoning Examinations. \u003cem\u003eJAMA Intern Med\u003c/em\u003e. 2023;183(9):1028\u0026ndash;1030.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eMesk\u0026oacute; B, Topol EJ.\u003c/strong\u003e The imperative for regulatory oversight of large language models (or generative AI) in healthcare. \u003cem\u003eLancet Digit Health\u003c/em\u003e. 2023;5(12):e858-e860.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eLi H, Moon JT, Purkayastha S, et al.\u003c/strong\u003e Ethics of large language models in medicine and medical education. \u003cem\u003eLancet Digit Health\u003c/em\u003e. 2023;5(6):e333-e335.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eDave T, Athaluri SA, Singh S.\u003c/strong\u003e ChatGPT in medicine: an overview of its applications, advantages, limitations, future prospects, and ethical considerations. \u003cem\u003eFront Artif Intell\u003c/em\u003e. 2023;6:1169595.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eLee P, Bubeck S, Petro J.\u003c/strong\u003e Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. \u003cem\u003eN Engl J Med\u003c/em\u003e. 2023;388:1233-1235.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eBender EM, Gebru T, McMillan-Major A, Shmitchell S.\u003c/strong\u003e On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? \u003cem\u003eFAccT \u0026apos;21\u003c/em\u003e. 2021:610-623.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eHaupt CE, Marks M.\u003c/strong\u003e AI-Generated Medical Advice\u0026mdash;GPT and Beyond. \u003cem\u003eJAMA\u003c/em\u003e. 2023;329(16):1349\u0026ndash;1350.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eWong RS, Ming LC, Dang CP, et al.\u003c/strong\u003e Medical student perceptions of the application of artificial intelligence in medical education: a systematic review. \u003cem\u003eJ Med Educ Curric Dev\u003c/em\u003e. 2024;11:23821205241226342.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eClusmann J, Kolb-Lagen B, Drutschke D, et al.\u003c/strong\u003e The future landscape of large language models in medicine. \u003cem\u003eCommun Med (Lond)\u003c/em\u003e. 2023;3:141.\u003c/li\u003e\n \u003cli\u003e\u003cstrong\u003eDe Angelis L, Baglivo F, Arzilli G, et al.\u003c/strong\u003e ChatGPT and the rise of large language models: the new AI-driven infodemic threat in academia. \u003cem\u003eBMC Res Notes\u003c/em\u003e. 2023;16:294.\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"archives-of-gynecology-and-obstetrics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"arch","sideBox":"Learn more about [Archives of Gynecology and Obstetrics](https://www.springer.com/journal/404)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/arch/default.aspx","title":"Archives of Gynecology and Obstetrics","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"Gynecology, Large Language Models, Medical Education, Clinical Reasoning, Hallucination, Pilot Study","lastPublishedDoi":"10.21203/rs.3.rs-8264890/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8264890/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003ePurpose\u003c/h2\u003e \u003cp\u003eAs artificial intelligence (AI) models evolve into their next generations, their application in specialized medical fields requires rigorous validation. While large language models (LLMs) have shown promise in general medicine, their reliability in complex gynecological clinical reasoning remains under-explored. This pilot study aimed to comparatively assess the knowledge retention, safety, and reasoning limitations of advanced AI chatbots in gynecology using a constrained zero-shot multiple-choice question (MCQ) format.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eA total of 70 text-based MCQs covering seven core gynecological modules were adapted from USMLE Step 2 CK standards. The questions were administered to four advanced AI models: ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus. To simulate a rapid-retrieval clinical scenario, models were tested under \"zero-shot\" conditions with a constrained prompt prohibiting reasoning steps. We performed both quantitative statistical analysis (Kruskal\u0026ndash;Wallis, Cochran\u0026rsquo;s Q) and qualitative error analysis to identify specific failure modes.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eContrary to expectations for advanced models, overall accuracy was unsatisfactory: Gemini-3 (32.86%), DeepSeek-V3.2 (30.00%), ChatGPT-5 (25.71%), and Claude-4.5-Opus (21.43%). Significant performance disparities were observed across modules. Notably, ChatGPT-5 scored 0.00% in \u003cem\u003eInfertility\u003c/em\u003e, while DeepSeek-V3.2 reached 70.00% in \u003cem\u003eCommon Benign Conditions\u003c/em\u003e. Qualitative analysis revealed three critical failure patterns: (1) Semantic Association Bias (confusing high-probability diseases with symptom-specific diagnoses), (2) Spatial Anatomy Confusion, and (3) Genetic Logic Reversal. No significant correlation was found between item difficulty and accuracy (p\u0026thinsp;\u0026gt;\u0026thinsp;0.05).\u003c/p\u003e\u003ch2\u003eConclusion\u003c/h2\u003e \u003cp\u003eUnder constrained non-reasoning prompts, even next-generation AI chatbots demonstrate unsatisfactory performance in gynecology. The qualitative analysis suggests that models often rely on probabilistic keyword matching rather than physiological simulation, leading to dangerous clinical errors (e.g., misdiagnosing adrenal enzymes). While potential exists, current reliability is insufficient for unsupervised use in gynecological education. These findings highlight the critical need for \"Chain-of-Thought\" prompting and human expert oversight.\u003c/p\u003e","manuscriptTitle":"Performance of Next-Generation AI Chatbots in Gynecological Knowledge Assessment: A Comparative Pilot Study of ChatGPT-5, Gemini-3, DeepSeek-V3.2, and Claude-4.5-Opus","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-12-16 12:53:05","doi":"10.21203/rs.3.rs-8264890/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2026-01-07T16:29:59+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-01-06T04:05:14+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2026-01-04T20:16:01+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"60257320518576070824439417848383128331","date":"2025-12-23T12:57:11+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"242690604414302763528033671733946098321","date":"2025-12-23T12:31:27+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"52586378157832336779981552714102578777","date":"2025-12-21T19:07:40+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-12-10T16:01:00+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-12-09T09:15:10+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-12-08T11:16:49+00:00","index":"","fulltext":""},{"type":"submitted","content":"Archives of Gynecology and Obstetrics","date":"2025-12-03T01:46:24+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"archives-of-gynecology-and-obstetrics","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"arch","sideBox":"Learn more about [Archives of Gynecology and Obstetrics](https://www.springer.com/journal/404)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/arch/default.aspx","title":"Archives of Gynecology and Obstetrics","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"em","reportingPortfolio":"Springer Hybrid","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"b9e3fe5e-def0-413f-a9aa-fd4ad2d8f648","owner":[],"postedDate":"December 16th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"published-in-journal","subjectAreas":[],"tags":[],"updatedAt":"2026-03-09T16:12:24+00:00","versionOfRecord":{"articleIdentity":"rs-8264890","link":"https://doi.org/10.1007/s00404-026-08358-7","journal":{"identity":"archives-of-gynecology-and-obstetrics","isVorOnly":false,"title":"Archives of Gynecology and Obstetrics"},"publishedOn":"2026-03-03 15:58:48","publishedOnDateReadable":"March 3rd, 2026"},"versionCreatedAt":"2025-12-16 12:53:05","video":"","vorDoi":"10.1007/s00404-026-08358-7","vorDoiUrl":"https://doi.org/10.1007/s00404-026-08358-7","workflowStages":[]},"version":"v1","identity":"rs-8264890","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8264890","identity":"rs-8264890","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00