Evaluating local large language models for structured extraction from endometriosis-specific transvaginal ultrasound reports

Haiyi Li; Yutong Li; Yiheng Chi; Alison Deslandes; Mathew Leonardi; Shay M. Freger; Yuan Zhang; Jodie Avery; M. Louise Hull; Hsiang‐Ting Chen; Hsiang-Ting Chen

Evaluating local large language models for structured extraction from endometriosis-specific transvaginal ultrasound reports

Haiyi Li, Yutong Li, Yiheng Chi, Alison Deslandes, Mathew Leonardi, Shay M. Freger, Yuan Zhang, Jodie Avery, M. Louise Hull, Hsiang‐Ting Chen, Hsiang-Ting Chen

2026 · W7124358020

article OA: green CC0

📄 Open PDF Full text JSON View on OpenAlex

⚙ AI-generated summary by claude@2026-06, 2026-06-07 ⓘ

A large language model was evaluated for extracting structured data from endometriosis ultrasound reports, achieving 86% accuracy and showing complementary strengths to human experts, supporting a human-in-the-loop workflow.

One-sentence paraphrase of the abstract; not a substitute for reading it. No clinical advice. How this works

⚙ AI-generated deep summary by claude@2026-06, 2026-06-07 · read from full text ⓘ

This paper evaluates locally deployed large language models for converting unstructured endometriosis-specific transvaginal ultrasound (eTVUS) reports into a 185-field structured JSON schema, benchmarking three on-premise LLMs (7B/8B and 20B) against expert human extraction using 49 de-identified Canadian clinic reports. The gpt-oss:20b model achieved the highest mean field-level accuracy (86.02%) but with the largest report-to-report variability, while a Clinical Research Assistant achieved 98.40% with low variability; the authors also report complementary error patterns, with LLMs better on structured/protocol fields and humans better on interpretive fields involving negation and clinical terminology. Targeted prompt engineering produced only marginal gains, which the authors interpret as model limitations rather than instruction gaps, and they note no significance testing due to descriptive benchmarking goals. This paper is centrally about endometriosis — it focuses on LLM-based structured extraction from endometriosis ultrasound reports and analyzes LLM vs human error patterns for that purpose.

Read from the paper's body, not the abstract. Not a substitute for reading the paper. No clinical advice. How this works

Abstract

In this study, we evaluate a locally-deployed large-language model (LLM) to convert unstructured endometriosis transvaginal ultrasound (eTVUS) scan reports into structured data for imaging informatics workflows. Across 49 eTVUS reports, we compared three LLMs (7B/8B and a 20B-parameter model) against expert human extraction. The 20B model achieved a mean accuracy of 86.02%, substantially outperforming smaller models and confirming the importance of scale in handling complex clinical text. Crucially, we identified a highly complementary error profile: the LLM excelled at syntactic consistency (e.g., date/numeric formatting) where humans faltered, while human experts provided superior semantic and contextual interpretation. We also found that the LLM's semantic errors were fundamental limitations that could not be mitigated by simple prompt engineering. These findings strongly support a human-in-the-loop (HITL) workflow in which the on-premise LLM serves as a collaborative tool, not a full replacement. It automates routine structuring and flags potential human errors, enabling imaging specialists to focus on high-level semantic validation. We discuss implications for structured reporting and interactive AI systems in clinical practice.

Full text 24,130 characters · extracted from oa-pdf · 3 sections · click to expand

Abstract

In this study, we evaluate locally deployed large language mod- els (LLMs) for converting unstructured endometriosis transvaginal ultrasound (eTVUS) reports into structured data. Across 49 de- identified reports, we compared three on-premise LLMs (7B/8B and 20B parameters) against expert human extraction using a 185- field schema. The 20B model achieved the highest mean accuracy (86.02%), substantially outperforming the smaller models. Crucially, LLMs and humans exhibited complementary error patterns: the LLM excelled on structured fields (date formatting, measurement decomposition) where humans made protocol errors, while humans demonstrated superior performance on interpretive fields involving negation and clinical terminology. Targeted prompt engineering yielded only marginal gains, indicating that these errors reflect model limitations rather than instruction gaps. These findings sup- port a human-in-the-loop workflow in which the LLM generates structured drafts, automated validation flags rule-verifiable errors, and human review focuses on fields requiring clinical interpreta- tion. CCS Concepts •Computing methodologies → Artificial intelligence;•Human- centered computing → Human computer interaction (HCI); •Applied computing→Life and medical sciences.

Keywords

information extraction; large language models; human-in-the-loop; medical reporting 1 Introduction Free-text ultrasound reports contain clinically valuable informa- tion, but key variables are embedded in heterogeneous narrative styles and local formatting conventions, limiting their use in an- alytics, model training, and auditing [ 14, 16, 18]. Across clinical domains, this structural barrier complicates secondary use and ne- cessitates substantial manual abstraction [8, 9, 13, 16]. In settings where privacy requirements preclude cloud-based processing, ex- traction remains a manual, safety-critical task: abstractors must interpret clinical content while enforcing protocol constraints such as field decomposition, formatting standards, and missingness con- ventions [5, 18]. Our contextual inquiry identified recurring risk points, including terminology variation, inconsistent report detail, and verification-heavy routines that induce fatigue and increase silent transcription and field-alignment errors [ 5, 18]. These ob- servations suggest that effective support tools should prioritise reviewability and accountability over full automation, enabling practitioners to calibrate their reliance on algorithmic assistance within real workflows [2, 7, 10]. Locally deployable LLMs offer a practical means of scaling ab- straction without transmitting sensitive reports to external services [3, 12]. Recent studies demonstrate that LLMs can perform few-shot clinical extraction with substantial medical knowledge [1, 15], yet they also produce well-formed outputs that are semantically incor- rect, failures that are difficult to detect from the output alone [5, 17]. In structured reporting, this problem is acute: schema-compliant responses may still mishandle negation, map terms to incorrect categories, or misinterpret context-dependent findings. Without visible indicators of error, users struggle to calibrate trust, leading to both over-reliance and under-reliance on model outputs [4, 6]. arXiv:2601.09053v2 [cs.HC] 26 Jan 2026 Li et al. Research on clinical AI deployments reinforces this concern, show- ing that operational success depends less on standalone accuracy than on workflow integration and mechanisms that direct reviewer attention to high-risk outputs [2, 7]. Designing such mechanisms requires understanding where LLMs and humans each fail. This study addresses that gap by investigating the error patterns of local LLMs and human abstractors to inform design guidelines for human-AI collaborative abstraction systems. Using 49 de-identified eTVUS reports and a 185-field extraction schema, we benchmark three on-premise models (7B to 20B parameters) against expert- verified human abstraction. Our analysis targets two dimensions with direct design implications: report-level variance, which gov- erns review effort and suggests where batch processing is viable, and field-type error patterns, which reveal where human judgment remains essential and should be preserved. We find that LLMs and humans fail in complementary ways, motivating a division of labour in which automated validation catches mechanical errors and risk-based triage routes semantically sensitive fields to human review. We also test whether targeted prompting can reduce errors on critical fields; the limited and inconsistent gains suggest that workflow-level safeguards, rather than prompt refinement, offer the more reliable mechanism for managing extraction risk. 2 Experiment 2.1 Data and Schema The dataset consisted of unstructured, de-identified sonologists reports obtained from a specialized gynecology and obstetrics ultra- sound clinic in Canada. These reports were heterogeneous, contain- ing both structured data fields and free-text ultrasound narratives. To prepare the data for the pipeline, each report, originally in PDF format, was converted to plain text. We used a layout-preserving extraction process designed to retain semantic content while re- moving extraneous metadata and formatting artifacts. The extraction target was defined by a structured endometrio- sis centric schema. This schema was created by programmatically transforming the header row of a reference Excel data dictionary into a concise JSON schema. This file defined all key fields, their data types, and output format constraints, serving as the ground truth structure for both model prompting and final evaluation [ 1, 13]. The reference Excel file contained 185 fields in total. Each field was programmatically assigned a data type based on its values and intended use. The schema included five major data types: Numeric (6 fields), Date (2 fields), Text (19 fields), and Categorical (157 fields). The majority of fields were categorical, typically representing con- trolled vocabularies or discrete clinical options, while a smaller subset are free-text or numeric entries. This distribution reflected the highly structured nature of the target schema and the clinical emphasis on standardized reporting. 2.2 Extraction Pipeline We designed an on-premise extraction pipeline to ensure full patient privacy and data sovereignty. The entire system operated offline, without reliance on external APIs, and is deployed on commodity hardware. The pipeline was built on the OLLAMA platform, using three different LLM modelsgpt-oss:20b,llama3-8bandmistral- 7b. The workflow proceeded as follows: (1) Schema-Guided Prompting:Each plain-text report was processed sequentially in batch mode. For every report, the model received the full textual content along with the JSON schema embedded within the prompt as an instructional template. (2) Inference:The LLM generated a structured JSON object containing the extracted field-specific values. This output was saved as an intermediate file for validation. (3) Validation and Post-processing:A rule-based validation layer was applied to the JSON outputs. This step normalized missing value indicators (e.g., 0, NA, and empty strings), harmonized categorical variables to a controlled vocabulary, and verified field completeness and data-type conformity [11]. The experiment ran on a personal computer equipped with an NVIDIA RTX 3090 GPU (24GB VRAM). Figure 1: Overview of the on-premise workflow. De-identified ultrasound PDFs and the organization’s spreadsheet tem- plate are processed locally by a layout-aware LLM backend to produce a schema-aligned structured JSON draft. A field stratification strategy and rule-based validation prioritize se- mantically sensitive fields for review. An interactive human- in-the-loop interface supports rapid trace-back from each field to relevant PDF text anchors, enabling verification and correction before exporting a verified structured dataset for clinical research use. 3 Preliminary Evaluation Scoring and accuracy definition.We reportfield-level accuracy scored against an expert-verified reference (Verified Truth). For each report, we score each of the 185 schema fields as correct (1) if the normalized prediction matches the normalized reference value, and incorrect (0) otherwise; report-level accuracy is the mean across fields, and we report the mean and standard deviation (SD) across 49 reports. Fornumeric/date fields, we convert outputs into a canon- ical representation and compare for equality in that canonical form. Fortext/categorical fields, we apply lightweight normalization (case-folding plus whitespace and punctuation normalization) and then require an exact match. Missing information is represented by an explicit token NOT_MENTION; a field is counted as correct when both prediction and reference indicate NOT_MENTION, and incorrect when one indicates missingness while the other provides a value. For protocol fields where 0 and NA both denotenot detected / not recorded, we treat 0 and NA as the same missingness state during scoring. We did not perform significance testing; comparisons be- low are descriptive and intended to characterize performance level and variability in this setting. Who Fails Where? LLM and Human Error Patterns in Endometriosis Ultrasound Report Extraction Table 1: Aggregate LLM Backbone Comparison. Overall mean accuracy, and per-report standard deviation (Std) on the Sugo dataset, benchmarked against the verified ground truth. Model Backbone Mean Accuracy (%) Std (%) gpt-oss:20b 86.02 6.87 llama3-8b 80.53 4.58 mistral-7b 78.89 4.68 Clinical RA 98.40 2.13 Our quantitative results are summarized in Table 1. Using the same expert-sonographer annotated and double-checked refer- ence labels (Verified Truth), we scored and compared three locally- deployed LLMs and a Clinical Research Assistant (Clinical RA) performing manual abstraction. The Clinical RA achieved a mean accuracy of98.40%with low variability (SD 2.13%). Among the three LLMs, gpt-oss:20b achieved the highest mean accuracy in our dataset (86.02%), while llama3-8b and mistral-7b achieved mean accuracies of 80.53% and 78.89%, respectively. Notably, while gpt-oss:20b achieved the highest mean accuracy among the LLMs, it also showed the largest per-report variance (SD 6.87%), indicating less consistent performance across reports. Figure 2: Report-level accuracy distributions for three locally- deployed LLMs. gpt-oss:20b attains a higher median accu- racy but exhibits larger variability, including occasional low- accuracy outliers. Figure 2 further illustrates these distributional differences. The box-and-whisker plot shows that gpt-oss:20b attains a higher median accuracy but also a wider interquartile range (IQR), with a small number of outliers where accuracy drops markedly (e.g., below 65%). Overall, this suggests that the larger model performs better on average in our dataset but is more sensitive to a subset of challenging reports, whereas smaller models exhibit a narrower (but lower) performance range. 3.1 Error Analysis by Field Type To investigate the sources of divergence, we stratified performance by field data type (Date, Numeric, Categorical, Text), as shown in Figure 3. Unless noted otherwise, the LLM results in this subsection focus on gpt-oss:20b as the best-performing local backbone in our study, to highlight its typical error distribution. The analysis reveals complementary failure modes between the LLM and the human extractor. The LLM performs best on more structured, protocol-constrained fields, achieving its highest ac- curacy onDate Fields (97.3%)andNumeric Fields (92.7%). Re- maining errors in these categories are primarily omissions and schema-alignment failures (e.g., incomplete decomposition of multi- dimensional measurements). In contrast, errors are more concen- trated in semantically sensitive Text and Categorical fields, where the model often fails through omissions or inconsistent terminol- ogy/ontology mapping [14, 16, 17]. Figure 3: Performance breakdown by field type. In our setting, the LLM performs best on structured fields (Date, Numeric) and worse on semantically nuanced fields (Text, Categorical). In contrast, the human extractor’s errors were rarely clinical misinterpretations, but instead were predominantly data-entry pro- tocol failures. A common error involved correctly reading a 3D nodule measurement from the report but failing to split it across three separate required database fields. 3.2 Error Analysis on Clinically Important Fields To examine whether prompt engineering could improve perfor- mance on key items, we conducted a follow-up experiment using a critical-field prompt for gpt-oss:20b. This prompt explicitly identified the seven most clinically critical fields, with the goal of improving extraction consistency for these items. The critical-field prompt achieved marginally higher mean accuracy (88.8%, SD = 6.23) compared to the generic prompt (87.0%, SD = 6.44). The critical- field prompt produced only a small change in mean accuracy on these fields, and the effect was not stable relative to report-level variability. We therefore treat this intervention as limited in utility for this schema-constrained task: emphasizing importance alone does not reliably change model behavior, and more practical safe- guards are likely to come from auditable workflow mechanisms (e.g., rule-based validation, risk-prioritized review, and structured error logging) [1, 5]. 4 Discussion 4.1 Complementary Failure Modes Enable Task Allocation Our results reveal that LLMs and human abstractors fail on different field types, establishing a basis for differentiated task allocation. Li et al. The LLM achieved near-human accuracy on Date (97.3%) and Nu- meric (92.7%) fields, where errors were predominantly mechani- cal (incomplete measurement decomposition, format mismatches) and detectable through rule-based validation. In contrast, the LLM struggled with Categorical and Text fields (77.3% and 80.0%), which require interpreting negation, mapping synonymous terms to con- trolled vocabularies, and inferring clinical intent. These semantic errors are syntactically well-formed, confirming that schema com- pliance alone cannot ensure extraction quality. This complementarity suggests that uniform human oversight is neither necessary nor efficient. Fields with high LLM reliability and rule-detectable failures can be auto-validated with exception-based review, while semantically sensitive fields should be flagged for mandatory human verification. Implementing this workflow re- quires interfaces that link each extracted value to its source text, en- abling reviewers to verify semantic correctness without re-reading entire reports. 4.2 Extraction Variance as a Deployment Criterion Mean accuracy alone obscures operational risk. The 20B model achieved the highest average accuracy (86.02%) but also the widest variance (SD = 6.87%), with outliers below 65%. Inspection of these cases revealed two patterns: reports with atypical formatting trig- gered cascading failures, and reports dense with negated or condi- tional findings led to accumulated errors across categorical fields. Smaller models showed lower variance but at a lower accuracy level, suggesting more consistent but conservative outputs. High variance translates to unpredictable review burden. A re- viewer calibrated for occasional errors may overlook reports where a third of extractions fail. This observation supports mechanisms such as confidence-based triage, routing structurally atypical or low-confidence reports to full review. More broadly, model selec- tion should be framed as a variance-accuracy trade-off: in some operational contexts, a smaller model with predictable, recover- able errors may prove more practical than a larger model whose sporadic failures are harder to detect. 5 Limitation Our evaluation is bounded by a modest sample (49 reports) from a single institution, which may not capture variation in reporting styles across sites. Data-sovereignty constraints restricted evalu- ation to locally deployable models (7B–20B parameters); cloud- hosted or domain-specialized medical backbones (e.g., MedGemma) may exhibit different error patterns and warrant future compari- son. Finally, our critical-field prompting experiment yielded null results, suggesting that for schema-constrained tasks where er- rors are semantic rather than attentional, prompt emphasis alone is insufficient—motivating workflow-level safeguards rather than instruction tuning as the more practical mitigation. 6 Conclusion We present a systematic evaluation of locally deployed LLMs for structured extraction from endometriosis transvaginal ultrasound reports, addressing settings where data-sovereignty requirements preclude cloud-based processing. Three findings inform the design of human-AI abstraction workflows. First, the 20B model achieved the highest field-level accuracy (86.02%) while remaining feasible for on-premise deployment, demonstrating that capable local mod- els can support clinical extraction tasks. Second, LLMs and human abstractors exhibit complementary error patterns: LLMs excel on structured fields while humans outperform on interpretive fields, suggesting a division of labour rather than full automation or uni- form oversight. Third, targeted prompt engineering yielded no meaningful improvement, indicating that prompt refinement alone cannot resolve errors on interpretive fields and that workflow-level safeguards are necessary. These findings support a human-in-the- loop workflow in which local LLMs generate structured drafts at scale, automated validation flags mechanical errors, and tar- geted human review addresses interpretive fields requiring clinical judgement. Future work should validate these patterns on multi- site datasets and develop lightweight, auditable mechanisms for confidence-based triage.

References

[1] Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, and David Sontag. 2022. Large language models are few-shot clinical information extractors.arXiv preprint(2022). https://arxiv.org/abs/2205.12689 arXiv:2205.12689. [2] Emma Beede, Elizabeth Baylor, Fred Hersch, Anna Iurchenko, Lauren Wilcox, Paisan Ruamviboonsuk, and Laura M. Vardoulakis. 2020. A Human-Centered Evaluation of a Deep Learning System Deployed in Clinics for the Detection of Diabetic Retinopathy. InProceedings of the 2020 CHI Conference on Human Factors in Computing Systems(Honolulu, HI, USA)(CHI ’20). Association for Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/3313831.3376718 [3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, et al . 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ran- zato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/ 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf [4] Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in AI-assisted Decision-making. 5, CSCW1, Article 188 (April 2021), 21 pages. doi:10.1145/3449287 [5] Felix Busch, Lena Hoffmann, Daniel Pinto Dos Santos, Marcus R Makowski, Luca Saba, Philipp Prucker, et al. 2025. Large language models for structured reporting in radiology: past, present, and future.European Radiology35, 5 (2025), 2589–2602. [6] Adrian Bussone, Simone Stumpf, and Dympna O’Sullivan. 2015. The Role of Explanations on Trust and Reliance in Clinical Decision Support Systems. In 2015 International Conference on Healthcare Informatics. 160–169. doi:10.1109/ ICHI.2015.26 [7] Carrie J. Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry. 2019. "Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for Human-AI Collaborative Decision-Making.Proc. ACM Hum.-Comput. Interact.3, CSCW, Article 104 (Nov. 2019), 24 pages. doi:10.1145/3359206 [8] Sergio M Castro, Eugene Tseytlin, Olga Medvedeva, Kevin Mitchell, Shyam Visweswaran, Tanja Bekhuis, and Rebecca S Jacobson. 2017. Automated annota- tion and classification of BI-RADS assessment from radiology reports.Journal of Biomedical Informatics69 (2017), 177–187. [9] Mary F Davis, Subramaniam Sriram, William S Bush, Joshua C Denny, and Jonathan L Haines. 2013. Automated extraction of clinical traits of multiple scle- rosis in electronic medical records.Journal of the American Medical Informatics Association20, e2 (2013), e334–e340. [10] Geraldine Fitzpatrick and Gunnar Ellingsen. 2013. A Review of 25 Years of CSCW Research in Healthcare: Contributions, Challenges and Future Agendas.Comput. Supported Coop. Work22, 4–6 (Aug. 2013), 609–665. doi:10.1007/s10606-012- 9168-0 [11] Sami-Ramzi Leyh-Bannurah, Zhe Tian, Pierre I Karakiewicz, Ulrich Wolffgang, Guido Sauter, Margit Fisch, et al. 2018. Deep learning for natural language pro- cessing in urology: state-of-the-art automated extraction of detailed pathologic prostate cancer data from narratively written electronic health records.JCO Clinical Cancer Informatics2 (2018), 1–9. [12] Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang. 2023. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge.Cureus15, 6 (2023). Who Fails Where? LLM and Human Error Patterns in Endometriosis Ultrasound Report Extraction [13] Guergana K Savova, Eugene Tseytlin, Sean Finan, Melissa Castine, Timothy Miller, Olga Medvedeva, et al. 2017. DeepPhe: a natural language processing system for extracting cancer phenotypes from clinical records.Cancer Research 77, 21 (2017), e115–e118. [14] Seyedmostafa Sheikhalishahi, Riccardo Miotto, Joel T Dudley, Alberto Lavelli, Fabio Rinaldi, and Venet Osmani. 2019. Natural Language Processing of Clinical Notes on Chronic Diseases: Systematic Review.JMIR Medical Informatics7, 2 (27 Apr 2019), e12239. doi:10.2196/12239 PubMed: 31066697. Also available at: http://medinform.jmir.org/2019/2/e12239/. [15] Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, et al. 2023. Large language models encode clinical knowledge.Nature 620, 7972 (2023), 172–180. [16] Yanshan Wang, Liwei Wang, Majid Rastegar-Mojarad, Sungrim Moon, Feichen Shen, Naveed Afzal, Sijia Liu, Yuqun Zeng, Saeed Mehrabi, Sunghwan Sohn, and Hongfang Liu. 2018. Clinical information extraction applications: A literature review.Journal of Biomedical Informatics77 (2018), 34–49. doi:10.1016/j.jbi.2017. 11.011 [17] Yuqing Wang, Yun Zhao, and Linda Petzold. 2023. Are large language models ready for healthcare? a comparative study on clinical language understanding. InMachine Learning for Healthcare Conference. PMLR, 804–823. [18] David L. Weiss and Curtis P. Langlotz. 2008. Structured Reporting: Patient Care Enhancement or Productivity Nightmare?Radiology249, 3 (2008), 739–747. doi:10.1148/radiol.2493080988 PMID: 19011178.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-pdf ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Condition tags

endometriosis

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

openalex: last seen: 2026-05-13T19:46:01.794608+00:00

License: CC0 · commercial use OK