Abstract
In this study, we evaluate locally deployed large language mod-
els (LLMs) for converting unstructured endometriosis transvaginal
ultrasound (eTVUS) reports into structured data. Across 49 de-
identified reports, we compared three on-premise LLMs (7B/8B
and 20B parameters) against expert human extraction using a 185-
field schema. The 20B model achieved the highest mean accuracy
(86.02%), substantially outperforming the smaller models. Crucially,
LLMs and humans exhibited complementary error patterns: the
LLM excelled on structured fields (date formatting, measurement
decomposition) where humans made protocol errors, while humans
demonstrated superior performance on interpretive fields involving
negation and clinical terminology. Targeted prompt engineering
yielded only marginal gains, indicating that these errors reflect
model limitations rather than instruction gaps. These findings sup-
port a human-in-the-loop workflow in which the LLM generates
structured drafts, automated validation flags rule-verifiable errors,
and human review focuses on fields requiring clinical interpreta-
tion.
CCS Concepts
•Computing methodologies → Artificial intelligence;•Human-
centered computing → Human computer interaction (HCI);
•Applied computing→Life and medical sciences.
Keywords
information extraction; large language models; human-in-the-loop;
medical reporting
1 Introduction
Free-text ultrasound reports contain clinically valuable informa-
tion, but key variables are embedded in heterogeneous narrative
styles and local formatting conventions, limiting their use in an-
alytics, model training, and auditing [ 14, 16, 18]. Across clinical
domains, this structural barrier complicates secondary use and ne-
cessitates substantial manual abstraction [8, 9, 13, 16]. In settings
where privacy requirements preclude cloud-based processing, ex-
traction remains a manual, safety-critical task: abstractors must
interpret clinical content while enforcing protocol constraints such
as field decomposition, formatting standards, and missingness con-
ventions [5, 18]. Our contextual inquiry identified recurring risk
points, including terminology variation, inconsistent report detail,
and verification-heavy routines that induce fatigue and increase
silent transcription and field-alignment errors [ 5, 18]. These ob-
servations suggest that effective support tools should prioritise
reviewability and accountability over full automation, enabling
practitioners to calibrate their reliance on algorithmic assistance
within real workflows [2, 7, 10].
Locally deployable LLMs offer a practical means of scaling ab-
straction without transmitting sensitive reports to external services
[3, 12]. Recent studies demonstrate that LLMs can perform few-shot
clinical extraction with substantial medical knowledge [1, 15], yet
they also produce well-formed outputs that are semantically incor-
rect, failures that are difficult to detect from the output alone [5, 17].
In structured reporting, this problem is acute: schema-compliant
responses may still mishandle negation, map terms to incorrect
categories, or misinterpret context-dependent findings. Without
visible indicators of error, users struggle to calibrate trust, leading
to both over-reliance and under-reliance on model outputs [4, 6].
arXiv:2601.09053v2 [cs.HC] 26 Jan 2026
Li et al.
Research on clinical AI deployments reinforces this concern, show-
ing that operational success depends less on standalone accuracy
than on workflow integration and mechanisms that direct reviewer
attention to high-risk outputs [2, 7]. Designing such mechanisms
requires understanding where LLMs and humans each fail.
This study addresses that gap by investigating the error patterns
of local LLMs and human abstractors to inform design guidelines for
human-AI collaborative abstraction systems. Using 49 de-identified
eTVUS reports and a 185-field extraction schema, we benchmark
three on-premise models (7B to 20B parameters) against expert-
verified human abstraction. Our analysis targets two dimensions
with direct design implications: report-level variance, which gov-
erns review effort and suggests where batch processing is viable,
and field-type error patterns, which reveal where human judgment
remains essential and should be preserved. We find that LLMs
and humans fail in complementary ways, motivating a division of
labour in which automated validation catches mechanical errors
and risk-based triage routes semantically sensitive fields to human
review. We also test whether targeted prompting can reduce errors
on critical fields; the limited and inconsistent gains suggest that
workflow-level safeguards, rather than prompt refinement, offer
the more reliable mechanism for managing extraction risk.
2 Experiment
2.1 Data and Schema
The dataset consisted of unstructured, de-identified sonologists
reports obtained from a specialized gynecology and obstetrics ultra-
sound clinic in Canada. These reports were heterogeneous, contain-
ing both structured data fields and free-text ultrasound narratives.
To prepare the data for the pipeline, each report, originally in PDF
format, was converted to plain text. We used a layout-preserving
extraction process designed to retain semantic content while re-
moving extraneous metadata and formatting artifacts.
The extraction target was defined by a structured endometrio-
sis centric schema. This schema was created by programmatically
transforming the header row of a reference Excel data dictionary
into a concise JSON schema. This file defined all key fields, their data
types, and output format constraints, serving as the ground truth
structure for both model prompting and final evaluation [ 1, 13].
The reference Excel file contained 185 fields in total. Each field
was programmatically assigned a data type based on its values and
intended use. The schema included five major data types: Numeric
(6 fields), Date (2 fields), Text (19 fields), and Categorical (157 fields).
The majority of fields were categorical, typically representing con-
trolled vocabularies or discrete clinical options, while a smaller
subset are free-text or numeric entries. This distribution reflected
the highly structured nature of the target schema and the clinical
emphasis on standardized reporting.
2.2 Extraction Pipeline
We designed an on-premise extraction pipeline to ensure full patient
privacy and data sovereignty. The entire system operated offline,
without reliance on external APIs, and is deployed on commodity
hardware. The pipeline was built on the OLLAMA platform, using
three different LLM modelsgpt-oss:20b,llama3-8bandmistral-
7b. The workflow proceeded as follows:
(1) Schema-Guided Prompting:Each plain-text report was
processed sequentially in batch mode. For every report, the
model received the full textual content along with the JSON
schema embedded within the prompt as an instructional
template.
(2) Inference:The LLM generated a structured JSON object
containing the extracted field-specific values. This output
was saved as an intermediate file for validation.
(3) Validation and Post-processing:A rule-based validation
layer was applied to the JSON outputs. This step normalized
missing value indicators (e.g., 0, NA, and empty strings),
harmonized categorical variables to a controlled vocabulary,
and verified field completeness and data-type conformity
[11].
The experiment ran on a personal computer equipped with an
NVIDIA RTX 3090 GPU (24GB VRAM).
Figure 1: Overview of the on-premise workflow. De-identified
ultrasound PDFs and the organization’s spreadsheet tem-
plate are processed locally by a layout-aware LLM backend
to produce a schema-aligned structured JSON draft. A field
stratification strategy and rule-based validation prioritize se-
mantically sensitive fields for review. An interactive human-
in-the-loop interface supports rapid trace-back from each
field to relevant PDF text anchors, enabling verification and
correction before exporting a verified structured dataset for
clinical research use.
3 Preliminary Evaluation
Scoring and accuracy definition.We reportfield-level accuracy
scored against an expert-verified reference (Verified Truth). For each
report, we score each of the 185 schema fields as correct (1) if the
normalized prediction matches the normalized reference value, and
incorrect (0) otherwise; report-level accuracy is the mean across
fields, and we report the mean and standard deviation (SD) across 49
reports. Fornumeric/date fields, we convert outputs into a canon-
ical representation and compare for equality in that canonical form.
Fortext/categorical fields, we apply lightweight normalization
(case-folding plus whitespace and punctuation normalization) and
then require an exact match. Missing information is represented by
an explicit token NOT_MENTION; a field is counted as correct when
both prediction and reference indicate NOT_MENTION, and incorrect
when one indicates missingness while the other provides a value.
For protocol fields where 0 and NA both denotenot detected / not
recorded, we treat 0 and NA as the same missingness state during
scoring. We did not perform significance testing; comparisons be-
low are descriptive and intended to characterize performance level
and variability in this setting.
Who Fails Where? LLM and Human Error Patterns in Endometriosis Ultrasound Report Extraction
Table 1: Aggregate LLM Backbone Comparison. Overall mean
accuracy, and per-report standard deviation (Std) on the Sugo
dataset, benchmarked against the verified ground truth.
Model Backbone Mean Accuracy (%) Std (%)
gpt-oss:20b 86.02 6.87
llama3-8b 80.53 4.58
mistral-7b 78.89 4.68
Clinical RA 98.40 2.13
Our quantitative results are summarized in Table 1. Using the
same expert-sonographer annotated and double-checked refer-
ence labels (Verified Truth), we scored and compared three locally-
deployed LLMs and a Clinical Research Assistant (Clinical RA)
performing manual abstraction. The Clinical RA achieved a mean
accuracy of98.40%with low variability (SD 2.13%). Among the
three LLMs, gpt-oss:20b achieved the highest mean accuracy in
our dataset (86.02%), while llama3-8b and mistral-7b achieved
mean accuracies of 80.53% and 78.89%, respectively. Notably, while
gpt-oss:20b achieved the highest mean accuracy among the LLMs,
it also showed the largest per-report variance (SD 6.87%), indicating
less consistent performance across reports.
Figure 2: Report-level accuracy distributions for three locally-
deployed LLMs. gpt-oss:20b attains a higher median accu-
racy but exhibits larger variability, including occasional low-
accuracy outliers.
Figure 2 further illustrates these distributional differences. The
box-and-whisker plot shows that gpt-oss:20b attains a higher
median accuracy but also a wider interquartile range (IQR), with
a small number of outliers where accuracy drops markedly (e.g.,
below 65%). Overall, this suggests that the larger model performs
better on average in our dataset but is more sensitive to a subset
of challenging reports, whereas smaller models exhibit a narrower
(but lower) performance range.
3.1 Error Analysis by Field Type
To investigate the sources of divergence, we stratified performance
by field data type (Date, Numeric, Categorical, Text), as shown in
Figure 3. Unless noted otherwise, the LLM results in this subsection
focus on gpt-oss:20b as the best-performing local backbone in
our study, to highlight its typical error distribution.
The analysis reveals complementary failure modes between the
LLM and the human extractor. The LLM performs best on more
structured, protocol-constrained fields, achieving its highest ac-
curacy onDate Fields (97.3%)andNumeric Fields (92.7%). Re-
maining errors in these categories are primarily omissions and
schema-alignment failures (e.g., incomplete decomposition of multi-
dimensional measurements). In contrast, errors are more concen-
trated in semantically sensitive Text and Categorical fields, where
the model often fails through omissions or inconsistent terminol-
ogy/ontology mapping [14, 16, 17].
Figure 3: Performance breakdown by field type. In our setting,
the LLM performs best on structured fields (Date, Numeric)
and worse on semantically nuanced fields (Text, Categorical).
In contrast, the human extractor’s errors were rarely clinical
misinterpretations, but instead were predominantly data-entry pro-
tocol failures. A common error involved correctly reading a 3D
nodule measurement from the report but failing to split it across
three separate required database fields.
3.2 Error Analysis on Clinically Important
Fields
To examine whether prompt engineering could improve perfor-
mance on key items, we conducted a follow-up experiment using
a critical-field prompt for gpt-oss:20b. This prompt explicitly
identified the seven most clinically critical fields, with the goal of
improving extraction consistency for these items. The critical-field
prompt achieved marginally higher mean accuracy (88.8%, SD =
6.23) compared to the generic prompt (87.0%, SD = 6.44). The critical-
field prompt produced only a small change in mean accuracy on
these fields, and the effect was not stable relative to report-level
variability. We therefore treat this intervention as limited in utility
for this schema-constrained task: emphasizing importance alone
does not reliably change model behavior, and more practical safe-
guards are likely to come from auditable workflow mechanisms
(e.g., rule-based validation, risk-prioritized review, and structured
error logging) [1, 5].
4 Discussion
4.1 Complementary Failure Modes Enable Task
Allocation
Our results reveal that LLMs and human abstractors fail on different
field types, establishing a basis for differentiated task allocation.
Li et al.
The LLM achieved near-human accuracy on Date (97.3%) and Nu-
meric (92.7%) fields, where errors were predominantly mechani-
cal (incomplete measurement decomposition, format mismatches)
and detectable through rule-based validation. In contrast, the LLM
struggled with Categorical and Text fields (77.3% and 80.0%), which
require interpreting negation, mapping synonymous terms to con-
trolled vocabularies, and inferring clinical intent. These semantic
errors are syntactically well-formed, confirming that schema com-
pliance alone cannot ensure extraction quality.
This complementarity suggests that uniform human oversight is
neither necessary nor efficient. Fields with high LLM reliability and
rule-detectable failures can be auto-validated with exception-based
review, while semantically sensitive fields should be flagged for
mandatory human verification. Implementing this workflow re-
quires interfaces that link each extracted value to its source text, en-
abling reviewers to verify semantic correctness without re-reading
entire reports.
4.2 Extraction Variance as a Deployment
Criterion
Mean accuracy alone obscures operational risk. The 20B model
achieved the highest average accuracy (86.02%) but also the widest
variance (SD = 6.87%), with outliers below 65%. Inspection of these
cases revealed two patterns: reports with atypical formatting trig-
gered cascading failures, and reports dense with negated or condi-
tional findings led to accumulated errors across categorical fields.
Smaller models showed lower variance but at a lower accuracy
level, suggesting more consistent but conservative outputs.
High variance translates to unpredictable review burden. A re-
viewer calibrated for occasional errors may overlook reports where
a third of extractions fail. This observation supports mechanisms
such as confidence-based triage, routing structurally atypical or
low-confidence reports to full review. More broadly, model selec-
tion should be framed as a variance-accuracy trade-off: in some
operational contexts, a smaller model with predictable, recover-
able errors may prove more practical than a larger model whose
sporadic failures are harder to detect.
5 Limitation
Our evaluation is bounded by a modest sample (49 reports) from
a single institution, which may not capture variation in reporting
styles across sites. Data-sovereignty constraints restricted evalu-
ation to locally deployable models (7B–20B parameters); cloud-
hosted or domain-specialized medical backbones (e.g., MedGemma)
may exhibit different error patterns and warrant future compari-
son. Finally, our critical-field prompting experiment yielded null
results, suggesting that for schema-constrained tasks where er-
rors are semantic rather than attentional, prompt emphasis alone
is insufficient—motivating workflow-level safeguards rather than
instruction tuning as the more practical mitigation.
6 Conclusion
We present a systematic evaluation of locally deployed LLMs for
structured extraction from endometriosis transvaginal ultrasound
reports, addressing settings where data-sovereignty requirements
preclude cloud-based processing. Three findings inform the design
of human-AI abstraction workflows. First, the 20B model achieved
the highest field-level accuracy (86.02%) while remaining feasible
for on-premise deployment, demonstrating that capable local mod-
els can support clinical extraction tasks. Second, LLMs and human
abstractors exhibit complementary error patterns: LLMs excel on
structured fields while humans outperform on interpretive fields,
suggesting a division of labour rather than full automation or uni-
form oversight. Third, targeted prompt engineering yielded no
meaningful improvement, indicating that prompt refinement alone
cannot resolve errors on interpretive fields and that workflow-level
safeguards are necessary. These findings support a human-in-the-
loop workflow in which local LLMs generate structured drafts
at scale, automated validation flags mechanical errors, and tar-
geted human review addresses interpretive fields requiring clinical
judgement. Future work should validate these patterns on multi-
site datasets and develop lightweight, auditable mechanisms for
confidence-based triage.
References
[1] Monica Agrawal, Stefan Hegselmann, Hunter Lang, Yoon Kim, and David Sontag.
2022. Large language models are few-shot clinical information extractors.arXiv
preprint(2022). https://arxiv.org/abs/2205.12689 arXiv:2205.12689.
[2] Emma Beede, Elizabeth Baylor, Fred Hersch, Anna Iurchenko, Lauren Wilcox,
Paisan Ruamviboonsuk, and Laura M. Vardoulakis. 2020. A Human-Centered
Evaluation of a Deep Learning System Deployed in Clinics for the Detection
of Diabetic Retinopathy. InProceedings of the 2020 CHI Conference on Human
Factors in Computing Systems(Honolulu, HI, USA)(CHI ’20). Association for
Computing Machinery, New York, NY, USA, 1–12. doi:10.1145/3313831.3376718
[3] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan,
Prafulla Dhariwal, et al . 2020. Language Models are Few-Shot Learners. In
Advances in Neural Information Processing Systems, H. Larochelle, M. Ran-
zato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates,
Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/
1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
[4] Zana Buçinca, Maja Barbara Malaya, and Krzysztof Z. Gajos. 2021. To Trust
or to Think: Cognitive Forcing Functions Can Reduce Overreliance on AI in
AI-assisted Decision-making. 5, CSCW1, Article 188 (April 2021), 21 pages.
doi:10.1145/3449287
[5] Felix Busch, Lena Hoffmann, Daniel Pinto Dos Santos, Marcus R Makowski,
Luca Saba, Philipp Prucker, et al. 2025. Large language models for structured
reporting in radiology: past, present, and future.European Radiology35, 5 (2025),
2589–2602.
[6] Adrian Bussone, Simone Stumpf, and Dympna O’Sullivan. 2015. The Role of
Explanations on Trust and Reliance in Clinical Decision Support Systems. In
2015 International Conference on Healthcare Informatics. 160–169. doi:10.1109/
ICHI.2015.26
[7] Carrie J. Cai, Samantha Winter, David Steiner, Lauren Wilcox, and Michael Terry.
2019. "Hello AI": Uncovering the Onboarding Needs of Medical Practitioners for
Human-AI Collaborative Decision-Making.Proc. ACM Hum.-Comput. Interact.3,
CSCW, Article 104 (Nov. 2019), 24 pages. doi:10.1145/3359206
[8] Sergio M Castro, Eugene Tseytlin, Olga Medvedeva, Kevin Mitchell, Shyam
Visweswaran, Tanja Bekhuis, and Rebecca S Jacobson. 2017. Automated annota-
tion and classification of BI-RADS assessment from radiology reports.Journal
of Biomedical Informatics69 (2017), 177–187.
[9] Mary F Davis, Subramaniam Sriram, William S Bush, Joshua C Denny, and
Jonathan L Haines. 2013. Automated extraction of clinical traits of multiple scle-
rosis in electronic medical records.Journal of the American Medical Informatics
Association20, e2 (2013), e334–e340.
[10] Geraldine Fitzpatrick and Gunnar Ellingsen. 2013. A Review of 25 Years of CSCW
Research in Healthcare: Contributions, Challenges and Future Agendas.Comput.
Supported Coop. Work22, 4–6 (Aug. 2013), 609–665. doi:10.1007/s10606-012-
9168-0
[11] Sami-Ramzi Leyh-Bannurah, Zhe Tian, Pierre I Karakiewicz, Ulrich Wolffgang,
Guido Sauter, Margit Fisch, et al. 2018. Deep learning for natural language pro-
cessing in urology: state-of-the-art automated extraction of detailed pathologic
prostate cancer data from narratively written electronic health records.JCO
Clinical Cancer Informatics2 (2018), 1–9.
[12] Yunxiang Li, Zihan Li, Kai Zhang, Ruilong Dan, Steve Jiang, and You Zhang.
2023. Chatdoctor: A medical chat model fine-tuned on a large language model
meta-ai (llama) using medical domain knowledge.Cureus15, 6 (2023).
Who Fails Where? LLM and Human Error Patterns in Endometriosis Ultrasound Report Extraction
[13] Guergana K Savova, Eugene Tseytlin, Sean Finan, Melissa Castine, Timothy
Miller, Olga Medvedeva, et al. 2017. DeepPhe: a natural language processing
system for extracting cancer phenotypes from clinical records.Cancer Research
77, 21 (2017), e115–e118.
[14] Seyedmostafa Sheikhalishahi, Riccardo Miotto, Joel T Dudley, Alberto Lavelli,
Fabio Rinaldi, and Venet Osmani. 2019. Natural Language Processing of Clinical
Notes on Chronic Diseases: Systematic Review.JMIR Medical Informatics7, 2
(27 Apr 2019), e12239. doi:10.2196/12239 PubMed: 31066697. Also available at:
http://medinform.jmir.org/2019/2/e12239/.
[15] Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won
Chung, et al. 2023. Large language models encode clinical knowledge.Nature
620, 7972 (2023), 172–180.
[16] Yanshan Wang, Liwei Wang, Majid Rastegar-Mojarad, Sungrim Moon, Feichen
Shen, Naveed Afzal, Sijia Liu, Yuqun Zeng, Saeed Mehrabi, Sunghwan Sohn, and
Hongfang Liu. 2018. Clinical information extraction applications: A literature
review.Journal of Biomedical Informatics77 (2018), 34–49. doi:10.1016/j.jbi.2017.
11.011
[17] Yuqing Wang, Yun Zhao, and Linda Petzold. 2023. Are large language models
ready for healthcare? a comparative study on clinical language understanding.
InMachine Learning for Healthcare Conference. PMLR, 804–823.
[18] David L. Weiss and Curtis P. Langlotz. 2008. Structured Reporting: Patient Care
Enhancement or Productivity Nightmare?Radiology249, 3 (2008), 739–747.
doi:10.1148/radiol.2493080988 PMID: 19011178.