Diagnostic Accuracy of Large Language Models Versus Clinicians in Severe Preeclampsia: A Cross-Sectional Study | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Diagnostic Accuracy of Large Language Models Versus Clinicians in Severe Preeclampsia: A Cross-Sectional Study Diah Putri, Ferry Achmad Firdaus, Akhmad Yogi Pramatirta¹ This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9244478/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 5 You are reading this latest preprint version Abstract Background Preeclampsia is still one of the most significant causes of maternal and perinatal morbidity and mortality. Multiorgan dysfunction on severe preeclampsia needs an early and correct assessment. The field of Medicine AI, and especially LLMs, could increase precision in diagnosis, but validating in obstetrical emergencies is scarce. Objective To establish the concordance between clinical judgment and three LLMs: ChatGPT, DeepSeek, and Gemini, in the diagnosis of severe preeclampsia using standardized clinical case data. Methods A cross-sectional analytic study was performed on 133 de-identified suspected cases of severe preeclampsia. Each case was individually diagnosed by clinicians and later by the three LLMs. Level of agreement was estimated using Cohen’s Kappa and diagnostic disagreement was evaluated with McNemar’s Test. Results ChatGPT had clinician agreement of moderate strength (Kappa = 0.593, p < 0.001), indicating statistically significant diagnostic concordance. DeepSeek showed very low agreement (Kappa = 0.178, p = 0.037), while Gemini demonstrated a negative correlation (Kappa = − 0.240, p = 0.006), suggesting systematic disagreement. According to McNemar’s test, there were no statistically significant differences in diagnosis between clinicians and any of the LLMs (ChatGPT p = 0.122; DeepSeek p = 0.105; Gemini p = 0.824), indicating similarity in overall diagnostic proportions despite variance in diagnostic accuracy. ChatGPT also had the highest sensitivity (76.6%) and specificity (83.9%). Conclusion Of all the evaluated LLMs, only ChatGPT consistently diagnosed severe preeclampsia in alignment with clinical evaluations, thereby validating its prospective functionality as a clinical decision support system. DeepSeek and Gemini’s lack of concordant diagnosis demonstrated the obstetric requirements for enhanced algorithm refinement and validation. AI Large Language Models Pre-eclampsia Diagnostic Agreement Clinical Decision Support INTRODUCTION Preeclampsia is a hypertensive disorder of pregnancy characterized by new-onset hypertension and multi-organ dysfunction occurring after 20 weeks of gestation. It remains one of the leading causes of maternal and perinatal morbidity and mortality worldwide, particularly in low- and middle-income countries. 1 , 2 The severe form of preeclampsia poses a major global health challenge and contributes significantly to maternal deaths, preterm births, and perinatal complications. In low-resource settings, where access to specialist care, laboratory facilities, and advanced monitoring is often limited, the burden of severe preeclampsia is disproportionately higher. 3 , 4 Severe preeclampsia is a complex condition affecting multiple organ systems that can quickly deteriorate into eclampsia, HELLP syndrome, kidney dysfunction, cerebral hemorrhage, or other serious and potentially fatal complications if not detected and managed promptly. 5 , 6 Accurate and timely diagnosis is therefore essential to ensure optimal maternal and neonatal outcomes. However, diagnostic accuracy often varies among clinicians due to overlapping clinical manifestations with other hypertensive disorders of pregnancy, such as gestational hypertension or chronic hypertension with superimposed preeclampsia. 7 Differences in clinical experience, subjective interpretation of symptoms, and limited access to confirmatory laboratory testing further contribute to variability in diagnosis, especially in peripheral or resource-constrained healthcare environments. 3 , 4 , 6 , 7 The recent advancements in artificial intelligence (AI) and computational medicine have introduced new possibilities for enhancing clinical decision-making. Among these, large language models (LLMs) represent one of the most transformative innovations in the field of digital health. LLMs such as ChatGPT, DeepSeek, and Gemini are designed to process and interpret large volumes of textual data, integrating clinical information and medical knowledge to produce structured reasoning and diagnostic outputs. By leveraging natural language processing and deep learning, LLMs have demonstrated growing potential to assist clinicians in generating differential diagnoses, interpreting clinical notes, and improving diagnostic consistency. Their adaptive learning capabilities make them especially promising for conditions requiring multi-factorial assessment, such as preeclampsia, where various physiological, biochemical, and clinical indicators must be synthesized into a coherent diagnosis. 8 – 11 While their potential is evident, there is limited evidence comparing the diagnostic performance of LLMs with human clinicians, particularly in obstetric emergencies where timely and accurate decisions are critical to maternal and fetal survival. 8 , 9 , 11 Most prior evaluations of LLMs in healthcare have focused on general internal medicine or non-acute scenarios, leaving significant knowledge gaps in their application to high-risk obstetric conditions. 8 – 10 Furthermore, no studies to date have been conducted in Indonesia assessing how LLMs perform in diagnosing severe preeclampsia, a leading cause of maternal morbidity in the country. 1 – 4 This gap underscores the urgent need for context-specific validation of AI systems before integration into clinical workflows. 8 , 9 , 11 This study was designed to evaluate the level of diagnostic concordance between clinicians and three Large Language Models—ChatGPT, DeepSeek, and Gemini in identifying severe preeclampsia cases. The results of this research are expected to enrich the growing body of knowledge on clinical AI validation and emphasize the potential contribution of LLMs as complementary decision-support tools for healthcare providers managing complex maternal health conditions across various clinical settings. METHODS A cross-sectional analytic study was conducted using 133 de-identified patient records from RSUD Al Ihsan, Bandung, collected between January and December 2024. The study population consisted of pregnant women with a gestational age of 20 weeks or more who were clinically diagnosed with severe preeclampsia based on the 2020 ACOG criteria. 1 Inclusion criteria included complete medical records with documented blood pressure, proteinuria, and laboratory data assessing target organ function. Patients with comorbidities that could confound the diagnosis (e.g., chronic kidney disease, lupus, or uncontrolled chronic hypertension) or incomplete records were excluded. Each case was reviewed independently by a clinical panel and subsequently analyzed by three LLMs (ChatGPT, DeepSeek, and Gemini) using standardized prompts derived from the ACOG diagnostic framework. The LLMs were asked to interpret clinical data and classify cases as either severe preeclampsia or non-severe. Diagnostic agreement was evaluated using Cohen’s Kappa coefficient, with interpretation following Landis and Koch criteria (0.41–0.60 moderate; 0.61–0.80 substantial; >0.81 almost perfect). McNemar’s test was employed to assess significant differences in paired proportions. Sensitivity and specificity were calculated to evaluate diagnostic performance. Statistical significance was set at p < 0.05. Model Interaction and Clinical Input Procedure To evaluate the diagnostic capabilities of each Large Language Model (LLM), a zero-shot prompting approach was utilized. Each clinical case was presented to three models (ChatGPT, DeepSeek, and Gemini) using a systematic input sequence: Role Assignment: The models were instructed to function as a specialist in obstetrics and gynecology. Data Presentation: Physiological information for each patient was provided in a structured format, including systolic and diastolic blood pressure, proteinuria status, and laboratory findings such as platelet counts, serum creatinine, and liver transaminase levels. Diagnostic Task: The models were required to classify the case as either "severe preeclampsia" or "non-severe" according to the established diagnostic criteria. RESULTS A total of 133 patient records met the inclusion criteria and were analyzed to determine the diagnostic concordance between clinicians and three LLMs. Table 1 presents the comparison of agreement levels using Cohen’s Kappa and McNemar’s test. ChatGPT demonstrated the highest degree of concordance with clinical diagnoses, showing a moderate agreement strength (Kappa = 0.593, p < 0.001) and no statistically significant difference according to McNemar’s test (p = 0.122). DeepSeek exhibited low agreement (Kappa = 0.178, p = 0.037) but similarly showed no significant difference (p = 0.105). In contrast, Gemini displayed a negative agreement (Kappa = − 0.240, p = 0.006), indicating systematic disagreement, although McNemar’s test likewise revealed no significant difference (p = 0.824). These findings suggest that while ChatGPT aligns relatively well with clinician judgment, the other two LLMs demonstrate inconsistent diagnostic alignment. Table 1 Diagnostic Agreement between Clinicians and Large Language Models Model Cohen’s Kappa p-value (Kappa) McNemar Test p-value Interpretation ChatGPT 0.593 0.000 0.122 Moderate agreement, no significant difference DeepSeek 0.178 0.037 0.105 Low agreement, no significant difference Gemini -0.204 0.006 0.824 Negative agreement, no significant difference Furthermore, Table 2 summarizes the sensitivity and specificity results for each model. ChatGPT achieved the best diagnostic accuracy, with a sensitivity of 76.6% and a specificity of 83.9%, reflecting good performance in correctly identifying both true positive and true negative cases. DeepSeek showed moderate accuracy, with sensitivity and specificity of 55.8% and 62.5%, respectively. Conversely, Gemini recorded poor diagnostic reliability, with a sensitivity of 45.5% and specificity of 30.4%, indicating frequent misclassification of both positive and negative cases. Overall, these results highlight ChatGPT’s relative consistency with clinician diagnoses compared to DeepSeek and Gemini, which require further refinement for reliable clinical application. Table 2 Diagnostic Performance of Large Language Models in Identifying Severe Preeclampsia Model Sensitivity (Recall) Specificity Interpretation ChatGPT 76.6% 83.9% Good at identifying both cases and-cases DeepSeek 55.8% 62.5% Moderate accuracy in identifying cases Gemini 45.5% 30.4% Low performance, with many false positives/negatives DISCUSSION Among the three models tested, ChatGPT showed moderate but statistically significant agreement with clinician diagnoses, suggesting it may serve as a useful adjunct to clinical judgment in obstetric diagnostic workflows. Recent research has shown that LLMs can assist physicians in generating differential diagnoses, interpreting clinical findings, and improving consistency in decision-making. 8 – 12 Studies comparing LLM performance with clinicians have demonstrated that ChatGPT and other models can achieve diagnostic accuracies approaching human levels in non-acute clinical contexts, although performance may decline in complex or emergency situations. 11 – 13 The moderate concordance found in this study likely reflects ChatGPT’s broad training across diverse medical corpora, which enables generalized reasoning even in the absence of detailed, case-specific obstetric datasets. 8 , 10 , 13 In contrast, DeepSeek and Gemini demonstrated lower diagnostic reliability, possibly due to limited obstetric data representation or insufficient domain-specific fine-tuning. Prior analyses have shown that model performance varies widely depending on the availability of structured medical information, prompting, and contextual language alignment. The negative agreement seen in Gemini indicates a higher likelihood of systematic misclassification when LLMs are used without tailored adaptation or dataset calibration. 9 – 13 A study by Elawad et al. noted that diagnostic inconsistency in preeclampsia arises partly from variability in interpreting criteria and laboratory findings. 6 Likewise, AI-based systems must capture this clinical nuance to produce reliable outputs. 8 , 12 Preeclampsia diagnosis often relies not only on objective thresholds but also on clinician assessment of evolving maternal symptoms and hemodynamic patterns—an aspect that can challenge purely algorithmic reasoning. 5 , 6 , 11 , 14 Several recent publications highlight the potential of AI-driven models for predicting or diagnosing preeclampsia using multimodal data, including laboratory, ultrasound, and hemodynamic inputs. For instance, a 2024 study utilizing electrocardiogram features achieved high predictive performance (AUC 0.85–0.98), underscoring the benefit of integrating multi-source clinical signals into diagnostic models. 14 – 16 However, these approaches remain largely experimental and have yet to be tested in emergency obstetric contexts in low- and middle-income countries such as Indonesia. Geographical disparities in AI research further compound this gap. Most validation studies for preeclampsia prediction and AI-based maternal health monitoring have been conducted in North America, China, and Europe, with limited data from Southeast Asia or similar healthcare systems. This highlights the need for region-specific studies to ensure models are trained and validated on populations that reflect local genetic, environmental, and healthcare differences. 3 , 4 , 7 , 14 , 16 Ethical and practical considerations also warrant attention. Integrating LLMs into clinical workflows requires careful monitoring to mitigate algorithmic bias, data privacy risks, and overreliance on AI recommendations. Transparency in model reasoning, continuous performance auditing, and human oversight are essential to prevent diagnostic errors and maintain patient safety. 9 , 10 , 12 , 17 In resource-limited environments, LLMs such as ChatGPT may offer valuable support for frontline clinicians by assisting in diagnostic triage, interpretation of medical data, and standardization of care protocols. Nonetheless, their role should remain complementary, not substitutive, to human expertise. Local validation and context-specific calibration are vital prerequisites before widespread clinical adoption. Further development and external validation of AI models are needed to ensure equitable, accurate, and safe integration into maternal healthcare systems. 1 , 4 , 8 , 11 , 12 , 14 , 16 This study has several limitations, including the use of retrospective de-identified data, the absence of obstetric-specific training for the evaluated LLMs, and a modest single-center sample size that may limit generalizability. Moreover, reliance on textual prompts without multimodal inputs such as laboratory or imaging data may underestimate AI diagnostic potential. Future multicenter, prospective studies with locally adapted model training are needed to improve external validity and clinical applicability. CONCLUSION This study found that ChatGPT demonstrated moderate agreement with clinicians (Kappa = 0.593, p < 0.001) and consistent alignment with the clinical diagnosis of severe preeclampsia, highlighting its potential as a Clinical Decision Support System (CDSS). In contrast, DeepSeek and Gemini showed low to negative agreement. These findings indicate that further refinement and domain-specific optimization are needed before their clinical application. Abbreviations ACOG American College of Obstetricians and Gynecologists AI Artificial Intelligence LLM Large Language Model RSUD Rumah Sakit Umum Daerah (Regional General Hospital) Declarations Ethics approval and consent to participate The study was approved by the Health Research Ethics Committee of RSUD Al Ihsan (Reference No. 9892/70/KEPK-RSUD.Al.Ihsan/IV/2025). Since this study used de-identified retrospective clinical data and involved no direct patient intervention, the requirement for informed consent was waived by the ethics committee. All methods were performed in accordance with the Declaration of Helsinki. Consent for publication Not applicable. Competing interests The authors declare that they have no competing interests. Funding This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. Author Contribution DASP led the study design, performed the primary data curation, conducted the formal analysis, and drafted the initial manuscript. FAF was responsible for the supervision of clinical case validation at RSUD Al Ihsan and provided the necessary institutional resources. AYP contributed to the development of the conceptual framework, provided clinical oversight, and performed the critical review and final editing of the paper. All authors have read and approved the final version of the manuscript for submission. Acknowledgements We extend our deepest gratitude to the Department of Obstetrics and Gynecology, Faculty of Medicine, Universitas Padjadjaran and Dr. Hasan Sadikin General Hospital, as well as RSUD Al Ihsan Bandung, for supporting the execution of this study. We thank all clinicians and residents who contributed their expertise in the clinical evaluation of cases. Data Availability The datasets analyzed during the current study are not publicly available due to institutional patient privacy policies but are available from the corresponding author on reasonable request. References American College of Obstetricians and Gynecologists. ACOG Practice Bulletin 222: Gestational Hypertension and Preeclampsia. Obstet Gynecol. 2020;135(6):e237–60. Cunningham FG, Leveno KJ, Bloom SL, Dashe JS, Spong CY, Hoffman BL, et al. editors. Williams obstetrics. 26th edition. New York: McGraw Hill Medical; 2022. 1 p. (McGraw-Hill’s AccessMedicine). Poon LC, Shennan A, Hyett JA, Kapur A, Hadar E, Divakar H, et al. The International Federation of Gynecology and Obstetrics (FIGO) initiative on pre-eclampsia: A pragmatic guide for first-trimester screening and prevention. Int J Gynecol Obstet. 2019;145(Suppl 1):1–33. Jim B, Karumanchi SA. Preeclampsia: Pathogenesis, prevention, and long-term complications. Semin Nephrol. 2017;37(4):386–97. Staff AC, Sibai BM, Cunningham FG. Prevention of preeclampsia and eclampsia. Chesley’s Hypertensive Disorders in Pregnancy. Elsevier; 2015. pp. 253–67. (publisher location not listed). Elawad T, Scott G, Bone JN, Lopez CE, Filippi V, et al. Risk factors for pre-eclampsia in clinical practice guidelines: Comparison with the evidence. BJOG. 2024;131(1):46–62. Huppertz B. Placental origins of preeclampsia: challenging the current hypothesis. Hypertension. 2008;51(4):970–5. Quinn J, Bonaparte J, Kilty S. Postoperative management in the prevention of complications after septoplasty: a systematic review. Laryngoscope. 2013;123(6):1328–33. Chubb J, Cowling P, Reed D. Speeding up to keep up: exploring the use of AI in the research process. AI Soc. 2022;37(4):1439–57. Alhur A, Redefining Healthcare With Artificial Intelligence (AI). The Contributions of ChatGPT, Gemini, and Co-pilot. Cureus. 2024;16(4):e57532–57532. Hassani H, Silva ES. The role of ChatGPT in data science: how AI-assisted conversational interfaces are revolutionizing the field. Big Data Cogn Comput. 2023;7(2):62–62. Noronha C, Shan S, et al. Evaluating large language models in clinical reasoning: performance and reliability. JAMIA Open. 2025;8(3):ooaf055–055. Shan S, Zhang Y, et al. Comparative diagnostic accuracy of large language models versus clinicians. Front Digit Health. 2025;7:1480031–1480031. Ali Y, Thompson J, et al. Artificial intelligence for hypertensive disorders of pregnancy: a systematic review. Am J Hypertens. 2024;37(2):145–59. Liu X, Zhang P, et al. Electrocardiogram-based prediction of preeclampsia using AI: a multicenter validation study. BMC Pregnancy Childbirth. 2024;24:1098–1098. Chandra S, Huang Y, et al. Regional disparities and adaptation needs for AI in maternal health. Lancet Digit Health. 2024;6(9):e812–20. Rahman A, Lee D, et al. Ethical challenges and governance frameworks for AI in clinical decision-making. NPJ Digit Med. 2024;7(1):41–41. Additional Declarations No competing interests reported. Cite Share Download PDF Status: Under Review Version 1 posted Reviewers invited by journal 23 Apr, 2026 Editor invited by journal 17 Apr, 2026 Editor assigned by journal 27 Mar, 2026 Submission checks completed at journal 27 Mar, 2026 First submitted to journal 27 Mar, 2026 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9244478","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":613557948,"identity":"3f250e9a-50d2-4851-a3ba-cf286d2d68c2","order_by":0,"name":"Diah Putri","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABHElEQVRIie2Rv0rDUBSHTymcLMfMJwTTV7gQiBlKnyWhoIsFQZAODorQLOIu+BCRQMQtcKFdLK4RHeri4pIpZBDxphEpkkhHh/sNd7j8Ps4/AI3mPyIBoaiY0DgDCI6aT64fhKBT6V1fDh2TMqWIbZQMsE+47zpcBzYV6FDMRf9tRSTDmfUeFisB4Z198ZrD6QhMo12xJO4J9pViTxJWjYX3N3PXh/kYkFatipDgsaC1Eq+VOA88BlQNc3sVIY2SA6wbWyZVoxyUDJ9/KeRxpsZH3km/qxx63Jt1K5akE+tcLRlpkvqBYDd+eTj2w6sxdc1iPi5S+0OdchAtk6dqOtyNn6PbvChHziBqr/Ibbo6iwrRV/sfSaDQazSZfpnVb1mqPRNoAAAAASUVORK5CYII=","orcid":"","institution":"Padjadjaran University – Dr. Hasan Sadikin General Hospital","correspondingAuthor":true,"prefix":"","firstName":"Diah","middleName":"","lastName":"Putri","suffix":""},{"id":613557949,"identity":"c177575f-38b7-4260-ba47-39daefc389a6","order_by":1,"name":"Ferry Achmad Firdaus","email":"","orcid":"","institution":"Padjadjaran University","correspondingAuthor":false,"prefix":"","firstName":"Ferry","middleName":"Achmad","lastName":"Firdaus","suffix":""},{"id":613557950,"identity":"1232e4e6-f6ca-44f4-a416-146b0c303013","order_by":2,"name":"Akhmad Yogi Pramatirta¹","email":"","orcid":"","institution":"Padjadjaran University – Dr. Hasan Sadikin General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Akhmad","middleName":"Yogi","lastName":"Pramatirta¹","suffix":""}],"badges":[],"createdAt":"2026-03-27 11:55:37","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-9244478/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9244478/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":106724678,"identity":"f9a6f954-1f1d-4843-b990-ec103c2a9235","added_by":"auto","created_at":"2026-04-12 18:29:07","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":482205,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9244478/v1/9cf92a33-30d7-4f57-809f-a56a0c2d561f.pdf"}],"financialInterests":"No competing interests reported.","formattedTitle":"Diagnostic Accuracy of Large Language Models Versus Clinicians in Severe Preeclampsia: A Cross-Sectional Study","fulltext":[{"header":"INTRODUCTION","content":"\u003cp\u003ePreeclampsia is a hypertensive disorder of pregnancy characterized by new-onset hypertension and multi-organ dysfunction occurring after 20 weeks of gestation. It remains one of the leading causes of maternal and perinatal morbidity and mortality worldwide, particularly in low- and middle-income countries.\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e The severe form of preeclampsia poses a major global health challenge and contributes significantly to maternal deaths, preterm births, and perinatal complications. In low-resource settings, where access to specialist care, laboratory facilities, and advanced monitoring is often limited, the burden of severe preeclampsia is disproportionately higher.\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e,\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eSevere preeclampsia is a complex condition affecting multiple organ systems that can quickly deteriorate into eclampsia, HELLP syndrome, kidney dysfunction, cerebral hemorrhage, or other serious and potentially fatal complications if not detected and managed promptly.\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e,\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e Accurate and timely diagnosis is therefore essential to ensure optimal maternal and neonatal outcomes. However, diagnostic accuracy often varies among clinicians due to overlapping clinical manifestations with other hypertensive disorders of pregnancy, such as gestational hypertension or chronic hypertension with superimposed preeclampsia.\u003csup\u003e\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e Differences in clinical experience, subjective interpretation of symptoms, and limited access to confirmatory laboratory testing further contribute to variability in diagnosis, especially in peripheral or resource-constrained healthcare environments.\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e,\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e,\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eThe recent advancements in artificial intelligence (AI) and computational medicine have introduced new possibilities for enhancing clinical decision-making. Among these, large language models (LLMs) represent one of the most transformative innovations in the field of digital health. LLMs such as ChatGPT, DeepSeek, and Gemini are designed to process and interpret large volumes of textual data, integrating clinical information and medical knowledge to produce structured reasoning and diagnostic outputs. By leveraging natural language processing and deep learning, LLMs have demonstrated growing potential to assist clinicians in generating differential diagnoses, interpreting clinical notes, and improving diagnostic consistency. Their adaptive learning capabilities make them especially promising for conditions requiring multi-factorial assessment, such as preeclampsia, where various physiological, biochemical, and clinical indicators must be synthesized into a coherent diagnosis.\u003csup\u003e\u003cspan additionalcitationids=\"CR9 CR10\" citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eWhile their potential is evident, there is limited evidence comparing the diagnostic performance of LLMs with human clinicians, particularly in obstetric emergencies where timely and accurate decisions are critical to maternal and fetal survival.\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e Most prior evaluations of LLMs in healthcare have focused on general internal medicine or non-acute scenarios, leaving significant knowledge gaps in their application to high-risk obstetric conditions.\u003csup\u003e\u003cspan additionalcitationids=\"CR9\" citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e Furthermore, no studies to date have been conducted in Indonesia assessing how LLMs perform in diagnosing severe preeclampsia, a leading cause of maternal morbidity in the country.\u003csup\u003e\u003cspan additionalcitationids=\"CR2 CR3\" citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e\u003c/sup\u003e This gap underscores the urgent need for context-specific validation of AI systems before integration into clinical workflows.\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eThis study was designed to evaluate the level of diagnostic concordance between clinicians and three Large Language Models\u0026mdash;ChatGPT, DeepSeek, and Gemini in identifying severe preeclampsia cases. The results of this research are expected to enrich the growing body of knowledge on clinical AI validation and emphasize the potential contribution of LLMs as complementary decision-support tools for healthcare providers managing complex maternal health conditions across various clinical settings.\u003c/p\u003e"},{"header":"METHODS","content":"\u003cp\u003eA cross-sectional analytic study was conducted using 133 de-identified patient records from RSUD Al Ihsan, Bandung, collected between January and December 2024. The study population consisted of pregnant women with a gestational age of 20 weeks or more who were clinically diagnosed with severe preeclampsia based on the 2020 ACOG criteria.\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e\u003c/sup\u003e Inclusion criteria included complete medical records with documented blood pressure, proteinuria, and laboratory data assessing target organ function. Patients with comorbidities that could confound the diagnosis (e.g., chronic kidney disease, lupus, or uncontrolled chronic hypertension) or incomplete records were excluded.\u003c/p\u003e \u003cp\u003eEach case was reviewed independently by a clinical panel and subsequently analyzed by three LLMs (ChatGPT, DeepSeek, and Gemini) using standardized prompts derived from the ACOG diagnostic framework. The LLMs were asked to interpret clinical data and classify cases as either severe preeclampsia or non-severe. Diagnostic agreement was evaluated using Cohen\u0026rsquo;s Kappa coefficient, with interpretation following Landis and Koch criteria (0.41\u0026ndash;0.60 moderate; 0.61\u0026ndash;0.80 substantial; \u0026gt;0.81 almost perfect). McNemar\u0026rsquo;s test was employed to assess significant differences in paired proportions. Sensitivity and specificity were calculated to evaluate diagnostic performance. Statistical significance was set at p\u0026thinsp;\u0026lt;\u0026thinsp;0.05.\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eModel Interaction and Clinical Input Procedure\u003c/h2\u003e \u003cp\u003eTo evaluate the diagnostic capabilities of each Large Language Model (LLM), a zero-shot prompting approach was utilized. Each clinical case was presented to three models (ChatGPT, DeepSeek, and Gemini) using a systematic input sequence:\u003c/p\u003e \u003cp\u003e \u003col\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eRole Assignment: The models were instructed to function as a specialist in obstetrics and gynecology.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eData Presentation: Physiological information for each patient was provided in a structured format, including systolic and diastolic blood pressure, proteinuria status, and laboratory findings such as platelet counts, serum creatinine, and liver transaminase levels.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003cspan\u003e \u003cli\u003e \u003cp\u003eDiagnostic Task: The models were required to classify the case as either \"severe preeclampsia\" or \"non-severe\" according to the established diagnostic criteria.\u003c/p\u003e \u003c/li\u003e \u003c/span\u003e \u003c/ol\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"RESULTS","content":"\u003cp\u003eA total of 133 patient records met the inclusion criteria and were analyzed to determine the diagnostic concordance between clinicians and three LLMs. Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e presents the comparison of agreement levels using Cohen\u0026rsquo;s Kappa and McNemar\u0026rsquo;s test. ChatGPT demonstrated the highest degree of concordance with clinical diagnoses, showing a moderate agreement strength (Kappa\u0026thinsp;=\u0026thinsp;0.593, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) and no statistically significant difference according to McNemar\u0026rsquo;s test (p\u0026thinsp;=\u0026thinsp;0.122). DeepSeek exhibited low agreement (Kappa\u0026thinsp;=\u0026thinsp;0.178, p\u0026thinsp;=\u0026thinsp;0.037) but similarly showed no significant difference (p\u0026thinsp;=\u0026thinsp;0.105). In contrast, Gemini displayed a negative agreement (Kappa = \u0026minus;\u0026thinsp;0.240, p\u0026thinsp;=\u0026thinsp;0.006), indicating systematic disagreement, although McNemar\u0026rsquo;s test likewise revealed no significant difference (p\u0026thinsp;=\u0026thinsp;0.824). These findings suggest that while ChatGPT aligns relatively well with clinician judgment, the other two LLMs demonstrate inconsistent diagnostic alignment.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eDiagnostic Agreement between Clinicians and Large Language Models\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCohen\u0026rsquo;s Kappa\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ep-value (Kappa)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eMcNemar Test \u003c/p\u003e \u003cp\u003ep-value\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eInterpretation\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eChatGPT\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.593\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.000\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.122\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eModerate agreement, no significant difference\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eDeepSeek\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.178\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.037\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.105\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eLow agreement, no significant difference\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eGemini\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e-0.204\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.006\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.824\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eNegative agreement, no significant difference\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eFurthermore, Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e summarizes the sensitivity and specificity results for each model. ChatGPT achieved the best diagnostic accuracy, with a sensitivity of 76.6% and a specificity of 83.9%, reflecting good performance in correctly identifying both true positive and true negative cases. DeepSeek showed moderate accuracy, with sensitivity and specificity of 55.8% and 62.5%, respectively. Conversely, Gemini recorded poor diagnostic reliability, with a sensitivity of 45.5% and specificity of 30.4%, indicating frequent misclassification of both positive and negative cases. Overall, these results highlight ChatGPT\u0026rsquo;s relative consistency with clinician diagnoses compared to DeepSeek and Gemini, which require further refinement for reliable clinical application.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eDiagnostic Performance of Large Language Models in Identifying Severe Preeclampsia\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"4\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eSensitivity (Recall)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSpecificity\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eInterpretation\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eChatGPT\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e76.6%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e83.9%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eGood at identifying both cases and-cases\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eDeepSeek\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e55.8%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e62.5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eModerate accuracy in identifying cases\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eGemini\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e45.5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e30.4%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003eLow performance, with many false positives/negatives\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e"},{"header":"DISCUSSION","content":"\u003cp\u003e Among the three models tested, ChatGPT showed moderate but statistically significant agreement with clinician diagnoses, suggesting it may serve as a useful adjunct to clinical judgment in obstetric diagnostic workflows. Recent research has shown that LLMs can assist physicians in generating differential diagnoses, interpreting clinical findings, and improving consistency in decision-making.\u003csup\u003e\u003cspan additionalcitationids=\"CR9 CR10 CR11\" citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e Studies comparing LLM performance with clinicians have demonstrated that ChatGPT and other models can achieve diagnostic accuracies approaching human levels in non-acute clinical contexts, although performance may decline in complex or emergency situations.\u003csup\u003e\u003cspan additionalcitationids=\"CR12\" citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e The moderate concordance found in this study likely reflects ChatGPT\u0026rsquo;s broad training across diverse medical corpora, which enables generalized reasoning even in the absence of detailed, case-specific obstetric datasets.\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e,\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eIn contrast, DeepSeek and Gemini demonstrated lower diagnostic reliability, possibly due to limited obstetric data representation or insufficient domain-specific fine-tuning. Prior analyses have shown that model performance varies widely depending on the availability of structured medical information, prompting, and contextual language alignment. The negative agreement seen in Gemini indicates a higher likelihood of systematic misclassification when LLMs are used without tailored adaptation or dataset calibration.\u003csup\u003e\u003cspan additionalcitationids=\"CR10 CR11 CR12\" citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR13\" class=\"CitationRef\"\u003e13\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eA study by Elawad et al. noted that diagnostic inconsistency in preeclampsia arises partly from variability in interpreting criteria and laboratory findings.\u003csup\u003e\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e\u003c/sup\u003e Likewise, AI-based systems must capture this clinical nuance to produce reliable outputs.\u003csup\u003e\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e Preeclampsia diagnosis often relies not only on objective thresholds but also on clinician assessment of evolving maternal symptoms and hemodynamic patterns\u0026mdash;an aspect that can challenge purely algorithmic reasoning.\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e,\u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e,\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e,\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eSeveral recent publications highlight the potential of AI-driven models for predicting or diagnosing preeclampsia using multimodal data, including laboratory, ultrasound, and hemodynamic inputs. For instance, a 2024 study utilizing electrocardiogram features achieved high predictive performance (AUC 0.85\u0026ndash;0.98), underscoring the benefit of integrating multi-source clinical signals into diagnostic models.\u003csup\u003e\u003cspan additionalcitationids=\"CR15\" citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e However, these approaches remain largely experimental and have yet to be tested in emergency obstetric contexts in low- and middle-income countries such as Indonesia. Geographical disparities in AI research further compound this gap. Most validation studies for preeclampsia prediction and AI-based maternal health monitoring have been conducted in North America, China, and Europe, with limited data from Southeast Asia or similar healthcare systems. This highlights the need for region-specific studies to ensure models are trained and validated on populations that reflect local genetic, environmental, and healthcare differences.\u003csup\u003e\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e,\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e,\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e,\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eEthical and practical considerations also warrant attention. Integrating LLMs into clinical workflows requires careful monitoring to mitigate algorithmic bias, data privacy risks, and overreliance on AI recommendations. Transparency in model reasoning, continuous performance auditing, and human oversight are essential to prevent diagnostic errors and maintain patient safety.\u003csup\u003e\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e,\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e,\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e\u003c/sup\u003e In resource-limited environments, LLMs such as ChatGPT may offer valuable support for frontline clinicians by assisting in diagnostic triage, interpretation of medical data, and standardization of care protocols. Nonetheless, their role should remain complementary, not substitutive, to human expertise. Local validation and context-specific calibration are vital prerequisites before widespread clinical adoption. Further development and external validation of AI models are needed to ensure equitable, accurate, and safe integration into maternal healthcare systems.\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e,\u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e,\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e,\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e,\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e,\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eThis study has several limitations, including the use of retrospective de-identified data, the absence of obstetric-specific training for the evaluated LLMs, and a modest single-center sample size that may limit generalizability. Moreover, reliance on textual prompts without multimodal inputs such as laboratory or imaging data may underestimate AI diagnostic potential. Future multicenter, prospective studies with locally adapted model training are needed to improve external validity and clinical applicability.\u003c/p\u003e"},{"header":"CONCLUSION","content":"\u003cp\u003eThis study found that ChatGPT demonstrated moderate agreement with clinicians (Kappa\u0026thinsp;=\u0026thinsp;0.593, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001) and consistent alignment with the clinical diagnosis of severe preeclampsia, highlighting its potential as a Clinical Decision Support System (CDSS). In contrast, DeepSeek and Gemini showed low to negative agreement. These findings indicate that further refinement and domain-specific optimization are needed before their clinical application.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cdiv class=\"DefinitionList\"\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eACOG\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eAmerican College of Obstetricians and Gynecologists\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eAI\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eArtificial Intelligence\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eLLM\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eLarge Language Model\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv class=\"DefinitionListEntry\"\u003e \u003cdiv class=\"Term\"\u003eRSUD\u003c/div\u003e \u003cdiv class=\"Description\"\u003e \u003cp\u003eRumah Sakit Umum Daerah (Regional General Hospital)\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003c/div\u003e"},{"header":"Declarations","content":" \u003cp\u003e \u003cstrong\u003eEthics approval and consent to participate\u003c/strong\u003e \u003cp\u003e The study was approved by the Health Research Ethics Committee of RSUD Al Ihsan (Reference No. 9892/70/KEPK-RSUD.Al.Ihsan/IV/2025). Since this study used de-identified retrospective clinical data and involved no direct patient intervention, the requirement for informed consent was waived by the ethics committee. All methods were performed in accordance with the Declaration of Helsinki.\u003c/p\u003e \u003cp\u003e \u003cstrong\u003eConsent for publication\u003c/strong\u003e \u003cp\u003eNot applicable.\u003c/p\u003e \u003ch2\u003eCompeting interests\u003c/h2\u003e \u003cp\u003eThe authors declare that they have no competing interests.\u003c/p\u003e \u003c/p\u003e\u003ch2\u003eFunding\u003c/h2\u003e \u003cp\u003eThis research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.\u003c/p\u003e\u003ch2\u003eAuthor Contribution\u003c/h2\u003e\u003cp\u003eDASP led the study design, performed the primary data curation, conducted the formal analysis, and drafted the initial manuscript. FAF was responsible for the supervision of clinical case validation at RSUD Al Ihsan and provided the necessary institutional resources. AYP contributed to the development of the conceptual framework, provided clinical oversight, and performed the critical review and final editing of the paper. All authors have read and approved the final version of the manuscript for submission.\u003c/p\u003e\u003ch2\u003eAcknowledgements\u003c/h2\u003e \u003cp\u003eWe extend our deepest gratitude to the Department of Obstetrics and Gynecology, Faculty of Medicine, Universitas Padjadjaran and Dr. Hasan Sadikin General Hospital, as well as RSUD Al Ihsan Bandung, for supporting the execution of this study. We thank all clinicians and residents who contributed their expertise in the clinical evaluation of cases.\u003c/p\u003e\u003ch2\u003eData Availability\u003c/h2\u003e\u003cp\u003eThe datasets analyzed during the current study are not publicly available due to institutional patient privacy policies but are available from the corresponding author on reasonable request.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eAmerican College of Obstetricians and Gynecologists. ACOG Practice Bulletin 222: Gestational Hypertension and Preeclampsia. Obstet Gynecol. 2020;135(6):e237\u0026ndash;60.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCunningham FG, Leveno KJ, Bloom SL, Dashe JS, Spong CY, Hoffman BL, et al. editors. Williams obstetrics. 26th edition. New York: McGraw Hill Medical; 2022. 1 p. (McGraw-Hill\u0026rsquo;s AccessMedicine).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePoon LC, Shennan A, Hyett JA, Kapur A, Hadar E, Divakar H, et al. The International Federation of Gynecology and Obstetrics (FIGO) initiative on pre-eclampsia: A pragmatic guide for first-trimester screening and prevention. Int J Gynecol Obstet. 2019;145(Suppl 1):1\u0026ndash;33.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJim B, Karumanchi SA. Preeclampsia: Pathogenesis, prevention, and long-term complications. Semin Nephrol. 2017;37(4):386\u0026ndash;97.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eStaff AC, Sibai BM, Cunningham FG. Prevention of preeclampsia and eclampsia. Chesley\u0026rsquo;s Hypertensive Disorders in Pregnancy. Elsevier; 2015. pp. 253\u0026ndash;67. (publisher location not listed).\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eElawad T, Scott G, Bone JN, Lopez CE, Filippi V, et al. Risk factors for pre-eclampsia in clinical practice guidelines: Comparison with the evidence. BJOG. 2024;131(1):46\u0026ndash;62.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHuppertz B. Placental origins of preeclampsia: challenging the current hypothesis. Hypertension. 2008;51(4):970\u0026ndash;5.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eQuinn J, Bonaparte J, Kilty S. Postoperative management in the prevention of complications after septoplasty: a systematic review. Laryngoscope. 2013;123(6):1328\u0026ndash;33.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChubb J, Cowling P, Reed D. Speeding up to keep up: exploring the use of AI in the research process. AI Soc. 2022;37(4):1439\u0026ndash;57.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAlhur A, Redefining Healthcare With Artificial Intelligence (AI). The Contributions of ChatGPT, Gemini, and Co-pilot. Cureus. 2024;16(4):e57532\u0026ndash;57532.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHassani H, Silva ES. The role of ChatGPT in data science: how AI-assisted conversational interfaces are revolutionizing the field. Big Data Cogn Comput. 2023;7(2):62\u0026ndash;62.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNoronha C, Shan S, et al. Evaluating large language models in clinical reasoning: performance and reliability. JAMIA Open. 2025;8(3):ooaf055\u0026ndash;055.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eShan S, Zhang Y, et al. Comparative diagnostic accuracy of large language models versus clinicians. Front Digit Health. 2025;7:1480031\u0026ndash;1480031.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAli Y, Thompson J, et al. Artificial intelligence for hypertensive disorders of pregnancy: a systematic review. Am J Hypertens. 2024;37(2):145\u0026ndash;59.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu X, Zhang P, et al. Electrocardiogram-based prediction of preeclampsia using AI: a multicenter validation study. BMC Pregnancy Childbirth. 2024;24:1098\u0026ndash;1098.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChandra S, Huang Y, et al. Regional disparities and adaptation needs for AI in maternal health. Lancet Digit Health. 2024;6(9):e812\u0026ndash;20.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRahman A, Lee D, et al. Ethical challenges and governance frameworks for AI in clinical decision-making. NPJ Digit Med. 2024;7(1):41\u0026ndash;41.\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"bmc-pregnancy-and-childbirth","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"prch","sideBox":"Learn more about [BMC Pregnancy and Childbirth](http://bmcpregnancychildbirth.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/prch/default.aspx","title":"BMC Pregnancy and Childbirth","twitterHandle":"@BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"AI, Large Language Models, Pre-eclampsia, Diagnostic Agreement, Clinical Decision Support","lastPublishedDoi":"10.21203/rs.3.rs-9244478/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9244478/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground\u003c/h2\u003e \u003cp\u003ePreeclampsia is still one of the most significant causes of maternal and perinatal morbidity and mortality. Multiorgan dysfunction on severe preeclampsia needs an early and correct assessment. The field of Medicine AI, and especially LLMs, could increase precision in diagnosis, but validating in obstetrical emergencies is scarce.\u003c/p\u003e\u003ch2\u003eObjective\u003c/h2\u003e \u003cp\u003eTo establish the concordance between clinical judgment and three LLMs: ChatGPT, DeepSeek, and Gemini, in the diagnosis of severe preeclampsia using standardized clinical case data.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eA cross-sectional analytic study was performed on 133 de-identified suspected cases of severe preeclampsia. Each case was individually diagnosed by clinicians and later by the three LLMs. Level of agreement was estimated using Cohen\u0026rsquo;s Kappa and diagnostic disagreement was evaluated with McNemar\u0026rsquo;s Test.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e \u003cp\u003eChatGPT had clinician agreement of moderate strength (Kappa\u0026thinsp;=\u0026thinsp;0.593, p\u0026thinsp;\u0026lt;\u0026thinsp;0.001), indicating statistically significant diagnostic concordance. DeepSeek showed very low agreement (Kappa\u0026thinsp;=\u0026thinsp;0.178, p\u0026thinsp;=\u0026thinsp;0.037), while Gemini demonstrated a negative correlation (Kappa = \u0026minus;\u0026thinsp;0.240, p\u0026thinsp;=\u0026thinsp;0.006), suggesting systematic disagreement. According to McNemar\u0026rsquo;s test, there were no statistically significant differences in diagnosis between clinicians and any of the LLMs (ChatGPT p\u0026thinsp;=\u0026thinsp;0.122; DeepSeek p\u0026thinsp;=\u0026thinsp;0.105; Gemini p\u0026thinsp;=\u0026thinsp;0.824), indicating similarity in overall diagnostic proportions despite variance in diagnostic accuracy. ChatGPT also had the highest sensitivity (76.6%) and specificity (83.9%).\u003c/p\u003e\u003ch2\u003eConclusion\u003c/h2\u003e \u003cp\u003eOf all the evaluated LLMs, only ChatGPT consistently diagnosed severe preeclampsia in alignment with clinical evaluations, thereby validating its prospective functionality as a clinical decision support system. DeepSeek and Gemini\u0026rsquo;s lack of concordant diagnosis demonstrated the obstetric requirements for enhanced algorithm refinement and validation.\u003c/p\u003e","manuscriptTitle":"Diagnostic Accuracy of Large Language Models Versus Clinicians in Severe Preeclampsia: A Cross-Sectional Study","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-04-09 14:57:17","doi":"10.21203/rs.3.rs-9244478/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"reviewersInvited","content":"","date":"2026-04-23T13:17:19+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2026-04-17T20:32:59+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2026-03-28T00:21:45+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2026-03-28T00:21:06+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Pregnancy and Childbirth","date":"2026-03-27T11:40:05+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"bmc-pregnancy-and-childbirth","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"prch","sideBox":"Learn more about [BMC Pregnancy and Childbirth](http://bmcpregnancychildbirth.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/prch/default.aspx","title":"BMC Pregnancy and Childbirth","twitterHandle":"@BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"caf52c59-9034-4859-8b07-d9fa5e20f41f","owner":[],"postedDate":"April 9th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2026-04-23T13:24:00+00:00","versionOfRecord":[],"versionCreatedAt":"2026-04-09 14:57:17","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9244478","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9244478","identity":"rs-9244478","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.