Accuracy of a Large Language Model in Identifying Gynecologic Surgeons for Endometriosis, Prolapse, and Hysterectomy

K. T. O'Donnell; M. Billow

doi:10.1097/aog.0000000000006205.1

INTRODUCTION: Artificial intelligence (AI) tools are increasingly used by patients to guide health decisions and self-direct specialty care. Although health systems employ referral guardrails, patients’ initial reliance on AI can shape care-seeking and entry points into the system. Referral misdirection risks delayed diagnosis, unnecessary consultations, treatment delays, and reduced access to subspecialty surgical care. OBJECTIVE: To evaluate the diagnostic accuracy of a large language model (LLM) in identifying appropriate gynecologic surgeons for endometriosis, pelvic organ prolapse (POP), and hysterectomy for heavy bleeding when queried with patient-style prompts. Secondary aims included comparing performance by prompt type and characterizing error patterns, including hallucination, misclassification, geographic misdirection, and outdated affiliations. METHODS: We evaluated LLM performance using publicly available data from multiple academic centers and community practices within a single U.S. metropolitan market. Twenty-one patient-style prompts were designed to reflect realistic queries and were systematically organized into four categories (symptom, diagnosis, treatment, navigation) and three combined forms (symptom+diagnosis, diagnosis+treatment, symptom+treatment). Example prompts included: “I have painful periods and want to know who treats endometriosis,” “I feel a bulge and want to know who treats prolapse,” and “Who does hysterectomy for heavy bleeding in my area?” Each prompt was run in triplicate with browsing ON and OFF, yielding 121 runs. Standardized guardrails instructed the model to return up to five named surgeons with institutional links, include a brief rationale for each recommendation, restrict responses to the study region, and avoid guessing. Gold-standard lists were created from institutional websites and professional society rosters. Outputs were validated for correctness, hallucination, subspecialty distribution, and reproducibility across replicates. As only publicly available data were used, the project met criteria for Not Human Subjects Research and did not require IRB review. RESULTS: Across 121 runs (590 names), overall precision was 71.5% (95% CI 67.8–75.0) and recall@5 was 100%, with at least one correct surgeon identified in every query. Hallucinations occurred in 7.6% of outputs, primarily due to nonexistent providers. Precision was significantly higher for hysterectomy (87.0%, 95% CI 81.6–91.0) compared with endometriosis (62.6%, 95% CI 55.4–69.3) and POP (64.4%, 95% CI 57.7–70.6) (p<0.001). Prompt-type analysis showed higher precision for symptom and treatment queries in endometriosis (78% and 74%) compared with navigation (47%). POP prompts performed similarly across categories (60–70%). Hysterectomy prompts were consistently strong, with navigation and combined forms exceeding 90% precision. Among 168 false positives, subspecialty misclassification (40%) and hallucinations (27%) predominated, followed by wrong geography (16%) and outdated affiliations (13%). In hysterectomy queries (n=200 names), referrals skewed toward MIGS 41.5% (95% CI 35.0–48.2) and Urogynecology 32.3% (95% CI 26.4–39.0), with OB/GYNs (11.0%), REI (4.6%), and GYN/ONC (2.4%). Median latency was 0.2 seconds, and reproducibility across triplicates was moderate. CONCLUSIONS: This is the first systematic evaluation of a large language model in gynecologic surgical referrals, demonstrating high recall but variable precision with frequent subspecialty misclassification and fabricated providers. AI presents both opportunities and risks for timely, accurate, and safe surgical care. Surgeon leadership is essential to guide integration of these tools into referral pathways and ensure equitable access to subspecialty surgical care.Figure 1Figure 2Table 1

Accuracy of a Large Language Model in Identifying Gynecologic Surgeons for Endometriosis, Prolapse, and Hysterectomy

Abstract

My notes (saved in your browser only)

Condition tags

Citation neighborhood (no data yet)

Source provenance