Who Fails Where? LLM and Human Error Patterns in Endometriosis Ultrasound Report Extraction

Haiyi Li; Yutong Li; Yiheng Chi; Alison Deslandes; Mathew Leonardi; Shay M. Freger; Shay Freger; Yuan Zhang; Jodie Avery; M. Louise Hull; Hsiang‐Ting Chen; Hsiang-Ting Chen

doi:10.48550/arxiv.2601.09053

Who Fails Where? LLM and Human Error Patterns in Endometriosis Ultrasound Report Extraction

Haiyi Li, Yutong Li, Yiheng Chi, Alison Deslandes, Mathew Leonardi, Shay M. Freger, Shay Freger, Yuan Zhang, Jodie Avery, M. Louise Hull, Hsiang‐Ting Chen, Hsiang-Ting Chen

2026 · doi:10.48550/arxiv.2601.09053 · W7124268603

preprint OA: green CC0

📄 Open PDF Full text JSON View on OpenAlex View at publisher

⚙ AI-generated summary by claude@2026-06, 2026-06-13 ⓘ

A 20B-parameter LLM achieved 86% accuracy extracting structured data from endometriosis ultrasound reports, complementing human expertise by excelling at syntax while humans provided better semantic interpretation.

One-sentence paraphrase of the abstract; not a substitute for reading it. No clinical advice. How this works

⚙ AI-generated deep summary by claude@2026-06, 2026-06-13 · read from full text ⓘ

This paper evaluates a locally deployed large-language model (LLM) to extract structured data from unstructured endometriosis transvaginal ultrasound (eTVUS) scan reports, comparing three LLMs (7B/8B and a 20B model) against expert human extraction across 49 reports. The 20B model reached a mean accuracy of 86.02%, outperforming smaller models, and the authors characterize a complementary error pattern where the LLM is stronger on syntactic consistency while humans perform better on semantic/contextual interpretation. A key limitation reported is that the LLM’s semantic errors are fundamental and cannot be resolved via simple prompt engineering. This paper is centrally about endometriosis — using LLMs to structure endometriosis ultrasound report data and analyzing where LLM versus human extraction errors occur.

Read from the paper's body, not the abstract. Not a substitute for reading the paper. No clinical advice. How this works

Abstract

In this study, we evaluate a locally-deployed large-language model (LLM) to convert unstructured endometriosis transvaginal ultrasound (eTVUS) scan reports into structured data for imaging informatics workflows. Across 49 eTVUS reports, we compared three LLMs (7B/8B and a 20B-parameter model) against expert human extraction. The 20B model achieved a mean accuracy of 86.02%, substantially outperforming smaller models and confirming the importance of scale in handling complex clinical text. Crucially, we identified a highly complementary error profile: the LLM excelled at syntactic consistency (e.g., date/numeric formatting) where humans faltered, while human experts provided superior semantic and contextual interpretation. We also found that the LLM's semantic errors were fundamental limitations that could not be mitigated by simple prompt engineering. These findings strongly support a human-in-the-loop (HITL) workflow in which the on-premise LLM serves as a collaborative tool, not a full replacement. It automates routine structuring and flags potential human errors, enabling imaging specialists to focus on high-level semantic validation. We discuss implications for structured reporting and interactive AI systems in clinical practice.

Full text 2,778 characters · extracted from oa-html · click to expand

Computer Science > Human-Computer Interaction [Submitted on 14 Jan 2026 (v1), last revised 26 Jan 2026 (this version, v2)] Title:Who Fails Where? LLM and Human Error Patterns in Endometriosis Ultrasound Report Extraction View PDF HTML (experimental)Abstract:In this study, we evaluate a locally-deployed large-language model (LLM) to convert unstructured endometriosis transvaginal ultrasound (eTVUS) scan reports into structured data for imaging informatics workflows. Across 49 eTVUS reports, we compared three LLMs (7B/8B and a 20B-parameter model) against expert human extraction. The 20B model achieved a mean accuracy of 86.02%, substantially outperforming smaller models and confirming the importance of scale in handling complex clinical text. Crucially, we identified a highly complementary error profile: the LLM excelled at syntactic consistency (e.g., date/numeric formatting) where humans faltered, while human experts provided superior semantic and contextual interpretation. We also found that the LLM's semantic errors were fundamental limitations that could not be mitigated by simple prompt engineering. These findings strongly support a human-in-the-loop (HITL) workflow in which the on-premise LLM serves as a collaborative tool, not a full replacement. It automates routine structuring and flags potential human errors, enabling imaging specialists to focus on high-level semantic validation. We discuss implications for structured reporting and interactive AI systems in clinical practice. Submission history From: Haiyi Li [view email][v1] Wed, 14 Jan 2026 00:46:51 UTC (170 KB) [v2] Mon, 26 Jan 2026 04:51:51 UTC (1,437 KB) References & Citations Loading... Bibliographic and Citation Tools Bibliographic Explorer (What is the Explorer?) Connected Papers (What is Connected Papers?) Litmaps (What is Litmaps?) scite Smart Citations (What are Smart Citations?) Code, Data and Media Associated with this Article alphaXiv (What is alphaXiv?) CatalyzeX Code Finder for Papers (What is CatalyzeX?) DagsHub (What is DagsHub?) Gotit.pub (What is GotitPub?) Hugging Face (What is Huggingface?) ScienceCast (What is ScienceCast?) Demos Recommenders and Search Tools Influence Flower (What are Influence Flowers?) CORE Recommender (What is CORE?) arXivLabs: experimental projects with community collaborators arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website. Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them. Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Condition tags

endometriosis

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

openalex: last seen: 2026-06-04T00:00:01.174412+00:00

License: CC0 · commercial use OK