{"paper_id":"b3eee059-988d-4e72-b0d7-b1b09d321e35","body_text":"Who Fails Where? LLM and Human Error Patterns in\nEndometriosis Ultrasound Report Extraction\nHaiyi Li\na1949007@adelaide.edu.au\nUniv. of Adelaide\nAdelaide, Australia\nYutong Li\na1948101@adelaide.edu.au\nUniv. of Adelaide\nAdelaide, Australia\nYiheng Chi\nyiheng.chi@student.adelaide.edu.au\nUniv. of Adelaide\nAdelaide, Australia\nAlison Deslandes\nalison.deslandes@adelaide.edu.au\nRobinson Inst., Univ. of Adelaide\nAdelaide, Australia\nMathew Leonardi\nleonam@mcmaster.ca\nMcMaster University\nHamilton, Canada\nShay Freger\nFregers@mcmaster.ca\nMcMaster University\nHamilton, Canada\nYuan Zhang\nyuan.zhang01@adelaide.edu.au\nRobinson Inst., Univ. of Adelaide\nAdelaide, Australia\nJodie Avery\njodie.avery@adelaide.edu.au\nRobinson Inst., Univ. of Adelaide\nAdelaide, Australia\nMary Louise Hull\nlouise.hull@adelaide.edu.au\nRobinson Inst., Univ. of Adelaide\nAdelaide, Australia\nHsiang-Ting Chen\ntim.chen@adelaide.edu.au\nUniv. of Adelaide\nAdelaide, SA, Australia\nAbstract\nIn this study, we evaluate locally deployed large language mod-\nels (LLMs) for converting unstructured endometriosis transvaginal\nultrasound (eTVUS) reports into structured data. Across 49 de-\nidentified reports, we compared three on-premise LLMs (7B/8B\nand 20B parameters) against expert human extraction using a 185-\nfield schema. The 20B model achieved the highest mean accuracy\n(86.02%), substantially outperforming the smaller models. Crucially,\nLLMs and humans exhibited complementary error patterns: the\nLLM excelled on structured fields (date formatting, measurement\ndecomposition) where humans made protocol errors, while humans\ndemonstrated superior performance on interpretive fields involving\nnegation and clinical terminology. Targeted prompt engineering\nyielded only marginal gains, indicating that these errors reflect\nmodel limitations rather than instruction gaps. These findings sup-\nport a human-in-the-loop workflow in which the LLM generates\nstructured drafts, automated validation flags rule-verifiable errors,\nand human review focuses on fields requiring clinical interpreta-\ntion.\nCCS Concepts\n•Computing methodologies → Artificial intelligence;•Human-\ncentered computing → Human computer interaction (HCI);\n•Applied computing→Life and medical sciences.\nKeywords\ninformation extraction; large language models; human-in-the-loop;\nmedical reporting\n1 Introduction\nFree-text ultrasound reports contain clinically valuable informa-\ntion, but key variables are embedded in heterogeneous narrative\nstyles and local formatting conventions, limiting their use in an-\nalytics, model training, and auditing [ 14, 16, 18]. Across clinical\ndomains, this structural barrier complicates secondary use and ne-\ncessitates substantial manual abstraction [8, 9, 13, 16]. In settings\nwhere privacy requirements preclude cloud-based processing, ex-\ntraction remains a manual, safety-critical task: abstractors must\ninterpret clinical content while enforcing protocol constraints such\nas field decomposition, formatting standards, and missingness con-\nventions [5, 18]. Our contextual inquiry identified recurring risk\npoints, including terminology variation, inconsistent report detail,\nand verification-heavy routines that induce fatigue and increase\nsilent transcription and field-alignment errors [ 5, 18]. These ob-\nservations suggest that effective support tools should prioritise\nreviewability and accountability over full automation, enabling\npractitioners to calibrate their reliance on algorithmic assistance\nwithin real workflows [2, 7, 10].\nLocally deployable LLMs offer a practical means of scaling ab-\nstraction without transmitting sensitive reports to external services\n[3, 12]. Recent studies demonstrate that LLMs can perform few-shot\nclinical extraction with substantial medical knowledge [1, 15], yet\nthey also produce well-formed outputs that are semantically incor-\nrect, failures that are difficult to detect from the output alone [5, 17].\nIn structured reporting, this problem is acute: schema-compliant\nresponses may still mishandle negation, map terms to incorrect\ncategories, or misinterpret context-dependent findings. Without\nvisible indicators of error, users struggle to calibrate trust, leading\nto both over-reliance and under-reliance on model outputs [4, 6].\narXiv:2601.09053v2  [cs.HC]  26 Jan 2026\n\nLi et al.\nResearch on clinical AI deployments reinforces this concern, show-\ning that operational success depends less on standalone accuracy\nthan on workflow integration and mechanisms that direct reviewer\nattention to high-risk outputs [2, 7]. Designing such mechanisms\nrequires understanding where LLMs and humans each fail.\nThis study addresses that gap by investigating the error patterns\nof local LLMs and human abstractors to inform design guidelines for\nhuman-AI collaborative abstraction systems. Using 49 de-identified\neTVUS reports and a 185-field extraction schema, we benchmark\nthree on-premise models (7B to 20B parameters) against expert-\nverified human abstraction. Our analysis targets two dimensions\nwith direct design implications: report-level variance, which gov-\nerns review effort and suggests where batch processing is viable,\nand field-type error patterns, which reveal where human judgment\nremains essential and should be preserved. We find that LLMs\nand humans fail in complementary ways, motivating a division of\nlabour in which automated validation catches mechanical errors\nand risk-based triage routes semantically sensitive fields to human\nreview. We also test whether targeted prompting can reduce errors\non critical fields; the limited and inconsistent gains suggest that\nworkflow-level safeguards, rather than prompt refinement, offer\nthe more reliable mechanism for managing extraction risk.\n2 Experiment\n2.1 Data and Schema\nThe dataset consisted of unstructured, de-identified sonologists\nreports obtained from a specialized gynecology and obstetrics ultra-\nsound clinic in Canada. These reports were heterogeneous, contain-\ning both structured data fields and free-text ultrasound narratives.\nTo prepare the data for the pipeline, each report, originally in PDF\nformat, was converted to plain text. We used a layout-preserving\nextraction process designed to retain semantic content while re-\nmoving extraneous metadata and formatting artifacts.\nThe extraction target was defined by a structured endometrio-\nsis centric schema. This schema was created by programmatically\ntransforming the header row of a reference Excel data dictionary\ninto a concise JSON schema. This file defined all key fields, their data\ntypes, and output format constraints, serving as the ground truth\nstructure for both model prompting and final evaluation [ 1, 13].\nThe reference Excel file contained 185 fields in total. Each field\nwas programmatically assigned a data type based on its values and\nintended use. The schema included five major data types: Numeric\n(6 fields), Date (2 fields), Text (19 fields), and Categorical (157 fields).\nThe majority of fields were categorical, typically representing con-\ntrolled vocabularies or discrete clinical options, while a smaller\nsubset are free-text or numeric entries. This distribution reflected\nthe highly structured nature of the target schema and the clinical\nemphasis on standardized reporting.\n2.2 Extraction Pipeline\nWe designed an on-premise extraction pipeline to ensure full patient\nprivacy and data sovereignty. The entire system operated offline,\nwithout reliance on external APIs, and is deployed on commodity\nhardware. The pipeline was built on the OLLAMA platform, using\nthree different LLM modelsgpt-oss:20b,llama3-8bandmistral-\n7b. The workflow proceeded as follows:\n(1) Schema-Guided Prompting:Each plain-text report was\nprocessed sequentially in batch mode. For every report, the\nmodel received the full textual content along with the JSON\nschema embedded within the prompt as an instructional\ntemplate.\n(2) Inference:The LLM generated a structured JSON object\ncontaining the extracted field-specific values. This output\nwas saved as an intermediate file for validation.\n(3) Validation and Post-processing:A rule-based validation\nlayer was applied to the JSON outputs. This step normalized\nmissing value indicators (e.g., 0, NA, and empty strings),\nharmonized categorical variables to a controlled vocabulary,\nand verified field completeness and data-type conformity\n[11].\nThe experiment ran on a personal computer equipped with an\nNVIDIA RTX 3090 GPU (24GB VRAM).\nFigure 1: Overview of the on-premise workflow. De-identified\nultrasound PDFs and the organization’s spreadsheet tem-\nplate are processed locally by a layout-aware LLM backend\nto produce a schema-aligned structured JSON draft. A field\nstratification strategy and rule-based validation prioritize se-\nmantically sensitive fields for review. An interactive human-\nin-the-loop interface supports rapid trace-back from each\nfield to relevant PDF text anchors, enabling verification and\ncorrection before exporting a verified structured dataset for\nclinical research use.\n3 Preliminary Evaluation\nScoring and accuracy definition.We reportfield-level accuracy\nscored against an expert-verified reference (Verified Truth). For each\nreport, we score each of the 185 schema fields as correct (1) if the\nnormalized prediction matches the normalized reference value, and\nincorrect (0) otherwise; report-level accuracy is the mean across\nfields, and we report the mean and standard deviation (SD) across 49\nreports. Fornumeric/date fields, we convert outputs into a canon-\nical representation and compare for equality in that canonical form.\nFortext/categorical fields, we apply lightweight normalization\n(case-folding plus whitespace and punctuation normalization) and\nthen require an exact match. Missing information is represented by\nan explicit token NOT_MENTION; a field is counted as correct when\nboth prediction and reference indicate NOT_MENTION, and incorrect\nwhen one indicates missingness while the other provides a value.\nFor protocol fields where 0 and NA both denotenot detected / not\nrecorded, we treat 0 and NA as the same missingness state during\nscoring. We did not perform significance testing; comparisons be-\nlow are descriptive and intended to characterize performance level\nand variability in this setting.\n\nWho Fails Where? LLM and Human Error Patterns in Endometriosis Ultrasound Report Extraction\nTable 1: Aggregate LLM Backbone Comparison. Overall mean\naccuracy, and per-report standard deviation (Std) on the Sugo\ndataset, benchmarked against the verified ground truth.\nModel Backbone Mean Accuracy (%) Std (%)\ngpt-oss:20b 86.02 6.87\nllama3-8b 80.53 4.58\nmistral-7b 78.89 4.68\nClinical RA 98.40 2.13\nOur quantitative results are summarized in Table 1. Using the\nsame expert-sonographer annotated and double-checked refer-\nence labels (Verified Truth), we scored and compared three locally-\ndeployed LLMs and a Clinical Research Assistant (Clinical RA)\nperforming manual abstraction. The Clinical RA achieved a mean\naccuracy of98.40%with low variability (SD 2.13%). Among the\nthree LLMs, gpt-oss:20b achieved the highest mean accuracy in\nour dataset (86.02%), while llama3-8b and mistral-7b achieved\nmean accuracies of 80.53% and 78.89%, respectively. Notably, while\ngpt-oss:20b achieved the highest mean accuracy among the LLMs,\nit also showed the largest per-report variance (SD 6.87%), indicating\nless consistent performance across reports.\nFigure 2: Report-level accuracy distributions for three locally-\ndeployed LLMs. gpt-oss:20b attains a higher median accu-\nracy but exhibits larger variability, including occasional low-\naccuracy outliers.\nFigure 2 further illustrates these distributional differences. The\nbox-and-whisker plot shows that gpt-oss:20b attains a higher\nmedian accuracy but also a wider interquartile range (IQR), with\na small number of outliers where accuracy drops markedly (e.g.,\nbelow 65%). Overall, this suggests that the larger model performs\nbetter on average in our dataset but is more sensitive to a subset\nof challenging reports, whereas smaller models exhibit a narrower\n(but lower) performance range.\n3.1 Error Analysis by Field Type\nTo investigate the sources of divergence, we stratified performance\nby field data type (Date, Numeric, Categorical, Text), as shown in\nFigure 3. Unless noted otherwise, the LLM results in this subsection\nfocus on gpt-oss:20b as the best-performing local backbone in\nour study, to highlight its typical error distribution.\nThe analysis reveals complementary failure modes between the\nLLM and the human extractor. The LLM performs best on more\nstructured, protocol-constrained fields, achieving its highest ac-\ncuracy onDate Fields (97.3%)andNumeric Fields (92.7%). Re-\nmaining errors in these categories are primarily omissions and\nschema-alignment failures (e.g., incomplete decomposition of multi-\ndimensional measurements). In contrast, errors are more concen-\ntrated in semantically sensitive Text and Categorical fields, where\nthe model often fails through omissions or inconsistent terminol-\nogy/ontology mapping [14, 16, 17].\nFigure 3: Performance breakdown by field type. In our setting,\nthe LLM performs best on structured fields (Date, Numeric)\nand worse on semantically nuanced fields (Text, Categorical).\nIn contrast, the human extractor’s errors were rarely clinical\nmisinterpretations, but instead were predominantly data-entry pro-\ntocol failures. A common error involved correctly reading a 3D\nnodule measurement from the report but failing to split it across\nthree separate required database fields.\n3.2 Error Analysis on Clinically Important\nFields\nTo examine whether prompt engineering could improve perfor-\nmance on key items, we conducted a follow-up experiment using\na critical-field prompt for gpt-oss:20b. This prompt explicitly\nidentified the seven most clinically critical fields, with the goal of\nimproving extraction consistency for these items. The critical-field\nprompt achieved marginally higher mean accuracy (88.8%, SD =\n6.23) compared to the generic prompt (87.0%, SD = 6.44). The critical-\nfield prompt produced only a small change in mean accuracy on\nthese fields, and the effect was not stable relative to report-level\nvariability. We therefore treat this intervention as limited in utility\nfor this schema-constrained task: emphasizing importance alone\ndoes not reliably change model behavior, and more practical safe-\nguards are likely to come from auditable workflow mechanisms\n(e.g., rule-based validation, risk-prioritized review, and structured\nerror logging) [1, 5].\n4 Discussion\n4.1 Complementary Failure Modes Enable Task\nAllocation\nOur results reveal that LLMs and human abstractors fail on different\nfield types, establishing a basis for differentiated task allocation.\n\nLi et al.\nThe LLM achieved near-human accuracy on Date (97.3%) and Nu-\nmeric (92.7%) fields, where errors were predominantly mechani-\ncal (incomplete measurement decomposition, format mismatches)\nand detectable through rule-based validation. In contrast, the LLM\nstruggled with Categorical and Text fields (77.3% and 80.0%), which\nrequire interpreting negation, mapping synonymous terms to con-\ntrolled vocabularies, and inferring clinical intent. These semantic\nerrors are syntactically well-formed, confirming that schema com-\npliance alone cannot ensure extraction quality.\nThis complementarity suggests that uniform human oversight is\nneither necessary nor efficient. Fields with high LLM reliability and\nrule-detectable failures can be auto-validated with exception-based\nreview, while semantically sensitive fields should be flagged for\nmandatory human verification. Implementing this workflow re-\nquires interfaces that link each extracted value to its source text, en-\nabling reviewers to verify semantic correctness without re-reading\nentire reports.\n4.2 Extraction Variance as a Deployment\nCriterion\nMean accuracy alone obscures operational risk. The 20B model\nachieved the highest average accuracy (86.02%) but also the widest\nvariance (SD = 6.87%), with outliers below 65%. Inspection of these\ncases revealed two patterns: reports with atypical formatting trig-\ngered cascading failures, and reports dense with negated or condi-\ntional findings led to accumulated errors across categorical fields.\nSmaller models showed lower variance but at a lower accuracy\nlevel, suggesting more consistent but conservative outputs.\nHigh variance translates to unpredictable review burden. A re-\nviewer calibrated for occasional errors may overlook reports where\na third of extractions fail. This observation supports mechanisms\nsuch as confidence-based triage, routing structurally atypical or\nlow-confidence reports to full review. More broadly, model selec-\ntion should be framed as a variance-accuracy trade-off: in some\noperational contexts, a smaller model with predictable, recover-\nable errors may prove more practical than a larger model whose\nsporadic failures are harder to detect.\n5 Limitation\nOur evaluation is bounded by a modest sample (49 reports) from\na single institution, which may not capture variation in reporting\nstyles across sites. Data-sovereignty constraints restricted evalu-\nation to locally deployable models (7B–20B parameters); cloud-\nhosted or domain-specialized medical backbones (e.g., MedGemma)\nmay exhibit different error patterns and warrant future compari-\nson. Finally, our critical-field prompting experiment yielded null\nresults, suggesting that for schema-constrained tasks where er-\nrors are semantic rather than attentional, prompt emphasis alone\nis insufficient—motivating workflow-level safeguards rather than\ninstruction tuning as the more practical mitigation.\n6 Conclusion\nWe present a systematic evaluation of locally deployed LLMs for\nstructured extraction from endometriosis transvaginal ultrasound\nreports, addressing settings where data-sovereignty requirements\npreclude cloud-based processing. Three findings inform the design\nof human-AI abstraction workflows. First, the 20B model achieved\nthe highest field-level accuracy (86.02%) while remaining feasible\nfor on-premise deployment, demonstrating that capable local mod-\nels can support clinical extraction tasks. Second, LLMs and human\nabstractors exhibit complementary error patterns: LLMs excel on\nstructured fields while humans outperform on interpretive fields,\nsuggesting a division of labour rather than full automation or uni-\nform oversight. Third, targeted prompt engineering yielded no\nmeaningful improvement, indicating that prompt refinement alone\ncannot resolve errors on interpretive fields and that workflow-level\nsafeguards are necessary. These findings support a human-in-the-\nloop workflow in which local LLMs generate structured drafts\nat scale, automated validation flags mechanical errors, and tar-\ngeted human review addresses interpretive fields requiring clinical\njudgement. Future work should validate these patterns on multi-\nsite datasets and develop lightweight, auditable mechanisms for\nconfidence-based triage.\nReferences\n[1] Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, and David Sontag.\n2022. Large language models are few-shot clinical information extractors.arXiv\npreprint(2022). https://arxiv.org/abs/2205.12689 arXiv:2205.12689.\n[2] Emma Beede, Elizabeth Baylor, Fred Hersch, Anna Iurchenko, Lauren Wilcox,\nPaisan Ruamviboonsuk, and Laura M. Vardoulakis. 2020. A Human-Centered\nEvaluation of a Deep Learning System Deployed in Clinics for the Detection\nof Diabetic Retinopathy. InProceedings of the 2020 CHI Conference on Human\nFactors in Computing Systems(Honolulu, HI, USA)(CHI ’20). Association for\nComputing Machinery, New York, NY, USA, 1–12. doi:10.1145/3313831.3376718\n[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,\nPrafulla Dhariwal, et al . 2020. Language Models are Few-Shot Learners. In\nAdvances in Neural Information Processing Systems, H. Larochelle, M. Ran-\nzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates,\nInc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/\n1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf\n[4] Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust\nor to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in\nAI-assisted Decision-making. 5, CSCW1, Article 188 (April 2021), 21 pages.\ndoi:10.1145/3449287\n[5] Felix Busch, Lena Hoffmann, Daniel Pinto Dos Santos, Marcus R Makowski,\nLuca Saba, Philipp Prucker, et al. 2025. Large language models for structured\nreporting in radiology: past, present, and future.European Radiology35, 5 (2025),\n2589–2602.\n[6] Adrian Bussone, Simone Stumpf, and Dympna O’Sullivan. 2015. The Role of\nExplanations on Trust and Reliance in Clinical Decision Support Systems. In\n2015 International Conference on Healthcare Informatics. 160–169. doi:10.1109/\nICHI.2015.26\n[7] Carrie J. Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry.\n2019. \"Hello AI\": Uncovering the Onboarding Needs of Medical Practitioners for\nHuman-AI Collaborative Decision-Making.Proc. ACM Hum.-Comput. Interact.3,\nCSCW, Article 104 (Nov. 2019), 24 pages. doi:10.1145/3359206\n[8] Sergio M Castro, Eugene Tseytlin, Olga Medvedeva, Kevin Mitchell, Shyam\nVisweswaran, Tanja Bekhuis, and Rebecca S Jacobson. 2017. Automated annota-\ntion and classification of BI-RADS assessment from radiology reports.Journal\nof Biomedical Informatics69 (2017), 177–187.\n[9] Mary F Davis, Subramaniam Sriram, William S Bush, Joshua C Denny, and\nJonathan L Haines. 2013. Automated extraction of clinical traits of multiple scle-\nrosis in electronic medical records.Journal of the American Medical Informatics\nAssociation20, e2 (2013), e334–e340.\n[10] Geraldine Fitzpatrick and Gunnar Ellingsen. 2013. A Review of 25 Years of CSCW\nResearch in Healthcare: Contributions, Challenges and Future Agendas.Comput.\nSupported Coop. Work22, 4–6 (Aug. 2013), 609–665. doi:10.1007/s10606-012-\n9168-0\n[11] Sami-Ramzi Leyh-Bannurah, Zhe Tian, Pierre I Karakiewicz, Ulrich Wolffgang,\nGuido Sauter, Margit Fisch, et al. 2018. Deep learning for natural language pro-\ncessing in urology: state-of-the-art automated extraction of detailed pathologic\nprostate cancer data from narratively written electronic health records.JCO\nClinical Cancer Informatics2 (2018), 1–9.\n[12] Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang.\n2023. Chatdoctor: A medical chat model fine-tuned on a large language model\nmeta-ai (llama) using medical domain knowledge.Cureus15, 6 (2023).\n\nWho Fails Where? LLM and Human Error Patterns in Endometriosis Ultrasound Report Extraction\n[13] Guergana K Savova, Eugene Tseytlin, Sean Finan, Melissa Castine, Timothy\nMiller, Olga Medvedeva, et al. 2017. DeepPhe: a natural language processing\nsystem for extracting cancer phenotypes from clinical records.Cancer Research\n77, 21 (2017), e115–e118.\n[14] Seyedmostafa Sheikhalishahi, Riccardo Miotto, Joel T Dudley, Alberto Lavelli,\nFabio Rinaldi, and Venet Osmani. 2019. Natural Language Processing of Clinical\nNotes on Chronic Diseases: Systematic Review.JMIR Medical Informatics7, 2\n(27 Apr 2019), e12239. doi:10.2196/12239 PubMed: 31066697. Also available at:\nhttp://medinform.jmir.org/2019/2/e12239/.\n[15] Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won\nChung, et al. 2023. Large language models encode clinical knowledge.Nature\n620, 7972 (2023), 172–180.\n[16] Yanshan Wang, Liwei Wang, Majid Rastegar-Mojarad, Sungrim Moon, Feichen\nShen, Naveed Afzal, Sijia Liu, Yuqun Zeng, Saeed Mehrabi, Sunghwan Sohn, and\nHongfang Liu. 2018. Clinical information extraction applications: A literature\nreview.Journal of Biomedical Informatics77 (2018), 34–49. doi:10.1016/j.jbi.2017.\n11.011\n[17] Yuqing Wang, Yun Zhao, and Linda Petzold. 2023. Are large language models\nready for healthcare? a comparative study on clinical language understanding.\nInMachine Learning for Healthcare Conference. PMLR, 804–823.\n[18] David L. Weiss and Curtis P. Langlotz. 2008. Structured Reporting: Patient Care\nEnhancement or Productivity Nightmare?Radiology249, 3 (2008), 739–747.\ndoi:10.1148/radiol.2493080988 PMID: 19011178.","source_license":"CC0","license_restricted":false}