Developing a scalable pipeline for data extraction from clinical letters through resource-efficient prompt engineering | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Developing a scalable pipeline for data extraction from clinical letters through resource-efficient prompt engineering Ariel Yuhan Ong, Quang Nguyen, Ishani Barai, Justin Engelmann, and 10 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8921439/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted You are reading this latest preprint version Abstract Free-text clinical records represent an untapped wealth of data for secondary use. Their potential is limited by resource demands necessary for accurate information extraction at scale. We introduce a scalable, resource-efficient, and high-performance pipeline which leverages large language models (LLMs) to address these challenges. This was developed and tested using real-world dual specialist-annotated ophthalmic clinical letters. Our pipeline achieved strong performance with a proprietary model in the development phase, yielding a maximum micro-averaged F1 score of 0.954 (95% CI 0.941–0.967) for diagnosis across nine conditions through iterative prompt refinement alone, also demonstrating strong generalisability (micro-F1 ranging from 0.945–0.980) in temporal validation. This approach extended to two other proprietary models in the same family and was tested in 17 local models from seven open-weight LLM families, demonstrating robustness against model choice and deployment constraints (for models > 10B parameters). Beyond performance, we develop a multi-dimensional assessment to evaluate LLMs for deployment in data extraction tasks, including introducing an error taxonomy to classify failure modes and implementing Pareto frontier analyses to systematically map the operational trade-offs (costs, time) across various LLM configurations. A robust approach to operationalising LLMs in real-world workflows at scale may help lay the foundation for next-generation data pipelines that can accelerate scientific discovery and power continuous learning health systems. Biological sciences/Computational biology and bioinformatics Physical sciences/Engineering Physical sciences/Mathematics and computing Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction The adoption of electronic health records (EHRs) has revolutionised healthcare delivery by enabling rapid access to patient information and improving documentation quality. 1 , 2 However, as EHRs were primarily designed with clinical, billing, and administrative needs in mind, 3,4 secondary use for research presents a misalignment due to data quality and availability. 5 In addition, an estimated 80% of EHR data exists in an unstructured format (e.g. as clinical letters, discharge summaries, progress notes, imaging reports), 6 and manual data extraction remains a significant bottleneck due to the resource requirements and risk of human errors. A scalable, systematic, and automated approach to data extraction is necessary to maximise the utility of this valuable resource. While natural language processing (NLP) techniques have shown promise, rule-based systems are brittle and do not generalise well, while traditional supervised machine learning (including early transformers such as BERT) require large datasets with task-specific annotations for fine-tuning. 7 – 9 Large language models (LLMs) are generative artificial intelligence (AI) models trained on vast corpora of text that can understand and generate natural human-like language. 10 Compared to traditional NLP methods, LLMs excel in tasks requiring contextual understanding and reasoning, and can adapt to nuances and variability in complex clinical narratives without requiring elaborate feature engineering or a large manually annotated training dataset. While fine-tuning LLMs can improve performance, alternative techniques such as prompt engineering may also minimise the need for labeled training data, computational resources, and custom training for each new use case. 10 , 11 Previous work has demonstrated the potential of this approach in extracting data from pathology reports, 12 radiology reports, 13 and hospital admission records in emergency and acute psychiatry settings. 14 , 15 However, beyond proof-of-concepts, deploying a high-performing model at scale is not a trivial task. Real-world deployment demands a more balanced consideration of accuracy that goes beyond sensitivity and specificity, given the impact of any upstream errors on downstream research outputs. In addition, a singular focus on performance is insufficient - a comprehensive validation framework that takes resource use and trade-offs into account is needed. One of the richest sources of unstructured clinical data is the corpus of clinical records relating to outpatient care, but this is relatively underexplored. Ophthalmology is an ideal specialty within which to evaluate this question - as the busiest outpatient specialty in the United Kingdom (UK), with close to 10 million outpatient appointments in England as of 2024-5, 16 it represents a rich source of untapped data from a high-volume specialty. In particular, ophthalmic clinical letters, a primary clinical artefact that summarises the clinical encounter and functions both as a care record and communication with primary care services, are characterised by a high density of specialist terminology and non-standardised abbreviations, 17 and typically contain a complex mix of unstructured narratives, semi-structured data, uncertainties that may be highly nuanced, and features such as laterality which are essential to clinical meaning, all without necessarily adhering to a standardised template. In this study, we develop a scalable pipeline for the complex task of extracting information from real-world ophthalmic clinical letters. We evaluate the extent to which out-of-the box LLMs can achieve a pre-specified and clinically reliable performance threshold, compare performance across frontier and locally deployable LLMs, and propose a multi-dimensional assessment to evaluate performance as well as operational efficiency, failure modes, and robustness for deployment on real-world data pipelines for downstream tasks. Using this approach, we select the optimal model to scale up our data extraction pipeline to a large dataset. Methods Task definition We aimed to demonstrate the development of an end-to-end pipeline for ophthalmic data extraction to support downstream research using real-world data, such as large-scale clinical validation of AI systems in the real world. For this proof-of-concept, we selected a deep learning algorithm for diagnosing and triaging patients with suspected macular disease as an exemplar. 18 We aimed to extract data on nine common macular conditions that the algorithm identifies from optical coherence tomography (OCT) scans - choroidal neovascularisation (CNV), macular oedema, central serous chorioretinopathy (CSCR), drusen, geographic atrophy (GA), epiretinal membrane (ERM), partial thickness macular hole (PTMH), full-thickness macular hole (FTMH), and vitreomacular traction (VMT) (as defined in Supplementary Table 1). This model has been tested against expert clinicians in a small highly-selected test set of 997 cases (diagnostic accuracy study), 18 but has not been compared against real-world clinician performance and real-world disease prevalence at scale (agreement study). As scaling up real-world testing is limited by the poor quality and availability of structured diagnosis data in EHR records, the goal of this study was to develop a pipeline that could reliably identify the presence of each disease class and the corresponding laterality at the time of the appointment from unstructured clinic letters. The study workflow is detailed in Fig. 1 . Data source This study used clinical letters from a retrospective cohort of adult patients aged ≥ 18 years who attended the retina service (medical retina and vitreoretinal clinics) at Moorfields Eye Hospital NHS Foundation Trust (MEH) from 1 February 2017 to 14 November 2024. MEH encompasses 30 networked centres serving a socioeconomically and ethnically diverse catchment of 6 million people across London in the UK, approximately 9% of the UK population. All clinical letters used in this study were written in English and were anonymised prior to analysis. This process employed an automated de-identification NLP pipeline (AnonCAT, Cogstack). 19 Protected health information (PHI) such as names, date of birth, identifiers (hospital or NHS numbers), location, and contact details were redacted in accordance with guidance issued by the UK Information Commissioner’s Office. 20 This study was conducted as a service improvement project (2024/1568_v1) and adhered to the tenets of the Declaration of Helsinki. Development dataset A development dataset of 600 letters (from 600 patients, all written in 2022) was developed. The dataset was constructed using a two-stage sampling strategy to ensure adequate representation of both common and rarer conditions. Initially, a consecutive sample of 400 clinical letters was labelled to mirror the real-world data distribution and disease complexity. Due to class imbalance (prevalence ranging from 1% in GA to 18% in macular oedema in this sample), four low-prevalence classes (CSCR, GA, PTMH, FTMH) were enriched with additional cases. For this enrichment dataset, we employed a hybrid retrieval method which combined a traditional keyword-based retriever (BM25) 21 with a semantic retriever (MedCPT) 22 . Rankings were integrated using reciprocal rank fusion with a decaying weight across multiple query terms to balance lexical and semantic relevance. 23 To mitigate potential bias from oversampling straightforward cases, this enrichment was stratified and incorporated a range of probabilities for each condition. This approach ensured statistically robust sensitivity estimates while maintaining feasibility for manual labeling. A minimum threshold of 40 positive cases per class was set to capture a range of case complexity and writing styles. Labelling guidelines and clinician labelling The primary extraction task required the identification of two key pieces of information for each of the nine disease classes: 1. Status: defined as ‘Absent’, ‘Present’, or ‘Suspected’. In order to be defined as ‘Present’, the condition had to be documented as current at the time of the clinic visit, explicitly excluding historical mentions if the condition had fully resolved without further sequelae. ‘Suspected’ referred to cases where the clinician expressed diagnostic uncertainty. 2. Laterality: defined as the corresponding laterality (left, right, both) for each condition. To support the development of a gold standard benchmark, labelling guidelines were initially developed by an ophthalmologist with eight years of clinical experience. This dataset was then labelled independently by two ophthalmologists who have practiced clinically for eight and six years. All discrepancies were resolved through a consensus review process to determine the final label for each case, with recourse to a senior ophthalmologist for arbitration where necessary. Agreement was tabulated using Cohen’s kappa (κ). This process also served to support the iterative refinement of the labelling guidelines to ensure clarity and consistency. Prompt engineering, refinement, and error taxonomy development The prompt was constructed using a modular approach to prompt engineering, which started with a simple baseline prompt, to which additional components and refinements were iteratively added with the aim of improving performance. The final standardised prompt integrated the following modules in a sequential chain (Fig. 2 ): 1. Task definition specifying the core instructions (“Role” and “Task”) and the required JSON output schema (“JSON Schema and Instructions”); and 2. Specifying a requirement to summarise and provide the evidence for each extracted finding (Evidence rationalisation); and 3. Core principles from the initial labelling guideline, which discuss principles relevant across all disease; and 4. Description of conditions comprising a detailed description of each condition from the initial labelling guideline; and 5. Refined instructions derived from the iterative error analysis which were aimed at mitigating common failure modes. Prompt 5b rearranged the position of different modules (i.e. moving the letter from the bottom of the prompt to the top, immediately after the role and task description) to test the effect on model performance; 6. In-context learning (ICL), which included providing a few examples of specific tasks (formatted as per the required output JSON schema) in which errors remained despite the provision of refined instructions; or 7. Chain-of-thought (CoT) thinking, which involved asking the model to “think” through each step of the task in sequential order. The prompt structure is outlined in Supplementary Fig. 1. An example of a letter and outputs is presented in Supplementary Fig. 2. Gemini 1.5 Flash (Google) was selected as the baseline development model for the iterative prompt engineering and refinement phase. This was because of its unique balance of speed, cost, and performance which helps facilitate rapid and iterative testing. All testing was conducted in August 2025 (temperature 0, maximum tokens 8092). Access to all Gemini models was provisioned through a secure virtual machine hosted in a dedicated controlled research environment in Google Cloud Platform. Iterative refinement was performed with the aim of optimising the model outputs, using a pre-specified performance target of F1 score \(\:\ge\:\) 0.95. The metric was chosen as it provides a balanced assessment of a model's precision and recall (sensitivity), since both are essential in ensuring the output is of sufficient quality to serve as a reliable “silver standard” for subsequent research and clinical validation tasks. F1 score was selected as the target over more commonly preferred metrics such as sensitivity and specificity in consideration of the need for balancing precision and recall for downstream tasks. In large-scale data extraction pipelines, false positives can introduce noise, whereas false negatives risk systematic omission of relevant clinical information. The F1 score therefore provides a more holistic assessment of model utility for real-world deployment, where performance must remain robust under varying prevalence and documentation styles. Errors - defined as false negatives (FN, missed entities) or false positives (FP, e.g. incorrect extractions or spurious predictions) - were identified from these initial outputs. Analysis of these errors informed successive refinements to prompt structure and content, as well the development of an error taxonomy to support qualitative examination of model failure modes. Taxonomy development followed Braun and Clarke’s approach to inductive thematic analysis, 24 which involved data familiarisation, generation of initial error categories based on observed patterns, and iterative refinement into key higher-level themes and sub-themes through repeated analytic engagement. Testing for generalisability and efficiency Additional testing was performed to assess the real-world applicability of this prompting strategy. Note that prompt development and refinement were conducted exclusively using the aforementioned development dataset. Once finalised, the prompts were frozen and evaluated without further modification on all external datasets and model families. Firstly, cross-model robustness was tested by evaluating the prompting strategy on other models from the Gemini family (‘Gemini 1.5 Pro’, ‘Gemini 2.5 Flash’, and ‘Gemini 2.5 Flash Updated’, an updated release issued by Google in September 2025 25 ), followed by testing on foundational LLM backbones from 7 different open-weight (“local”) model families (Gemma/Medgemma, Llama, Mistral, DeepSeek, Phi, Cogito, GPT-oss) and across a range of model sizes within the same family. To facilitate the evaluation of local models in the context of constrained computational resources, inference for all local models was executed via the ollama framework 26 using 4-bit post-training quantisation, primarily using the Q4_K_M and Q_4_0 GGUF variants. Prior work has demonstrated that 4-bit quantisation offers an optimal trade-off between memory footprint and model performance. 27 All experiments were performed on a secure virtual machine, which were equipped with 1–4 NVIDIA Tesla T4 GPUs for experiments conducted with local models. The full list of models, parameters, and computational resources are presented in Supplementary Table 2. Secondly, performance of the primary LLM family was evaluated on an independent dataset of 300 letters randomly sampled from the wider dataset of 219,930 letters spanning 2017–2024 (excluding 2022), which comprised the primary subspecialties tested in the “development dataset” (medical retina, vitreoretinal surgery) and included additional unseen ophthalmic subspecialties in which macular OCT scans may be used. This was to test robustness against potential data drift resulting from clinician turnover and/or evolving writing styles, changing disease prevalence or complexity, or other temporal factors. A Pareto frontier analysis was conducted to identify the subset of LLMs that offer the optimal trade-off between performance (the micro-averaged F1 score to determine the overall value of each model) plotted against real-world considerations for scalability (mean cost and time per letter). The Pareto frontier represents the set of models for which no other model is superior on a specific metric. For the proprietary models (Gemini 1.5 Flash, 1.5 Pro, 2.5 Flash), cost was calculated using Gemini’s per-million-token pricing for input and output tokens. For local models, cost was calculated as a factor of GPU infrastructural costs and time. We established the Pareto frontiers across all models and for local models alone. Error analysis An LLM-as-a-judge configuration was developed to facilitate error classification according to the error taxonomy. A representative sample of errors drawn from all models across 25 letters was coded by an experienced ophthalmologist. In parallel, a LLM (Gemini 2.5 Flash, selected for its high performance and optimal speed-cost configuration) was configured as an automated classifier (“LLM judge”) for the same error set, and was provided with identical contextual inputs, including the original clinical letter, the human reference standard and reasoning, and the erroneous LLM response and reasoning trace. Agreement between the human expert and LLM judge was measured using Cohen’s kappa (κ). The error classification prompt was refined to achieve a strong level of agreement (κ > 0.80) 28 against the human reference standard, whereupon the automated workflow was deployed to classify the full corpus of errors to characterise failure modes across all models and prompts. Statistical analysis Descriptive statistics were used to characterise the dataset. Counts and proportions were summarised for categorical variables, and means and standard deviation (SD) or medians and interquartile range (IQR) were reported for continuous variables after checking for normality using visual inspection of Q-Q plots. Each letter could contain zero, one, or multiple classes simultaneously. This was therefore treated as a multi-label classification problem, and performance metrics were therefore calculated in a per-class, ‘one-vs-rest’ manner, treating each class independently while allowing for multiple labels within the same clinical letter. Inter-rater agreement between both labellers was calculated using Cohen's κ on a per-class basis. Performance metrics were firstly computed on a per-class basis. This included sensitivity (recall), specificity, positive predictive value (PPV, or precision), negative predictive value (NPV), and F1 score. Micro- and macro-averaged scores were then computed for each metric to summarise performance across all classes. Macro-averaged scores weight all classes equally, while micro-averaged scores provide a size-weighted measure of overall performance, to ensure that rarer disease classes are adequately represented. Point estimates were accompanied by 95% confidence intervals (CIs) to quantify uncertainty, using the Wilson score method for proportion-based metrics or nonparametric bootstrap sampling (5000 iterations) as appropriate. All analyses were conducted in Python (v3.12.4). The statsmodels and scikit-learn libraries were used to compute classification metrics and agreement, and to quantify uncertainty. Results Dataset characteristics The ‘development’ dataset used for iterative prompt refinement comprised 600 clinic letters from 600 patients (median age 58 years [IQR 54–77]; 55.5% female). There was strong inter-rater agreement between clinical experts using the initial labelling guidelines prior to arbitration and refinement (Cohen’s κ = 0.834). The distribution of ground truth labels is shown in Supplementary Table 3. Iterative prompt refinement and performance evaluation For diagnosis extraction, the iterative prompt engineering process, conducted on the development set using Gemini 1.5 flash, demonstrated moderately high baseline performance on prompt 1 (task instructions only), with a micro-averaged F1 of 0.850 (95% CI 0.826–0.874). Model performance improved incrementally across the iterative prompt refinement cycle, with the most significant improvements seen following the integration of two key modules into the prompt - reasoning, which instructed the model to cite evidence from the text for each finding before classification (prompt 2), and refinement to the labelling guideline (prompt 5a-b), including revisions to the ‘core principles’ and ‘description of conditions’ sections based on the error analysis detailed in the next subsection. Conversely, incorporating more advanced prompt engineering techniques such as few shot learning (prompt 6) and chain-of-thought prompting (prompt 7) resulted in a slight performance degradation over the clear instructions informed by error analysis that were provided in prompts 5a-b (Fig. 3 A). Per-class analysis showed that performance for most conditions improved sharply from prompt 4 to 5, although there was one condition (ERM) that achieved excellent performance from baseline (Supplementary Fig. 3). The top-performing prompt (prompt 5b) achieved macro- and micro-averaged F1 scores of 0.960 (95% CI 0.936–0.975) and 0.954 (95% CI 0.941–0.967) respectively, with sensitivity 0.99 [0.98-1.00]; specificity 0.99 [0.99–0.99], PPV 0.92 [0.89–0.94], NPV 1.00 [1.00–1.00]). Seven out of nine classes reached a per-class F1 score exceeding the pre-specified threshold of 0.95. Sensitivities and specificities were high across the board and near-perfect in almost all cases (Supplementary Table 4). The same iterative refinement cycle was applied to two other models in the same family (1.5 Pro and 2.5 Flash), demonstrating a fairly similar trajectory of performance improvement, confirming that the principles of including detailed classification guidelines are generalisable across the same model family. Overall, Gemini 2.5 Flash achieved the strongest performance with a micro-averaged F1 of 0.98 (95% CI 0.97–0.98), sensitivity 0.98 (0.96–0.99), specificity 1.00 (1.00–1.00), PPV 0.97 (0.95–0.98), and NPV 1.00 (1.00–1.00). All nine disease classes exceeded the pre-specified threshold. Laterality extraction achieved strong performance across all prompts tested, with small variations between prompts that did not follow the trajectory for diagnosis extraction (Supplementary Table 5). Gemini 1.5 Flash achieved micro-F1 scores ranging from 0.960–0.975, while Gemini 2.5 Flash achieved 0.979–0.988 across prompts 1–7. Generalisability across other models Performance across seven additional LLM families and multiple model sizes (17 local LLMs in total) was evaluated to assess the interoperability of our approach. Overall, the iterative prompt refinement process demonstrated robustness across model families differing in scale and training strategy, although the magnitude of improvement varied between individual models and families. This robustness did not extend to specific models which tended to be smaller (typically < 10b parameters), where performance declined as prompt length and complexity increased. Results for micro-F1 scores for diagnosis extraction are summarised in Table 1 and Fig. 3 B, and all other metrics are presented in Supplementary Table 6. Table 1 Summary of micro-averaged F1 scores with 95% confidence intervals (CI) for diagnosis for all models, categorised by prompt. Micro-averaged F1 scores (95% CI) Model Prompt 1 Prompt 2 Prompt 3 Prompt 4 Prompt 5a Prompt 5b Prompt 6 Prompt 7 cogito:8b 0.773 (0.74–0.80) 0.742 (0.71–0.77) 0.759 (0.73–0.79) 0.816 (0.79–0.84) 0.825 (0.80–0.85) 0.772 (0.74–0.80) 0.767 (0.74–0.79) 0.780 (0.75–0.81) cogito:14b 0.831 (0.81–0.86) 0.868 (0.84–0.89) 0.865 (0.84–0.89) 0.888 (0.87–0.91) 0.835 (0.81–0.86) 0.847 (0.82–0.87) 0.830 (0.81–0.85) 0.787 (0.76–0.81) cogito:32b 0.882 (0.86–0.90) 0.859 (0.83–0.88) 0.897 (0.88–0.92) 0.891 (0.87–0.91) 0.924 (0.91–0.94) 0.930 (0.91–0.94) 0.927 (0.91–0.94) 0.881 (0.86–0.90) deepseek-r1:14b 0.876 (0.85–0.90) 0.888 (0.87–0.91) 0.877 (0.85–0.90) 0.897 (0.88–0.92) 0.925 (0.91–0.94) 0.922 (0.90–0.94) 0.932 (0.92–0.95) 0.939 (0.92–0.95) deepseek-r1:32b 0.881 (0.86–0.90) 0.883 (0.86–0.90) 0.886 (0.86–0.91) 0.895 (0.88–0.92) 0.934 (0.92–0.95) 0.942 (0.93–0.96) 0.946 (0.93–0.96) 0.924 (0.91–0.94) gemini-1.5-flash 0.850 (0.83–0.87) 0.873 (0.85–0.90) 0.872 (0.85–0.89) 0.897 (0.88–0.92) 0.920 (0.90–0.94) 0.954 (0.94–0.97) 0.949 (0.94–0.96) 0.918 (0.90–0.93) gemini-1.5-pro 0.872 (0.85–0.89) 0.876 (0.85–0.90) 0.865 (0.84–0.89) 0.893 (0.87–0.91) 0.947 (0.93–0.96) 0.952 (0.94–0.96) 0.950 (0.94–0.96) 0.941 (0.93–0.95) gemini-2.5-flash 0.891 (0.87–0.91) 0.890 (0.87–0.91) 0.882 (0.86–0.90) 0.911 (0.89–0.93) 0.969 (0.96–0.98) 0.975 (0.96–0.98) 0.971 (0.96–0.98) 0.975 (0.96–0.98) gemini-2.5-flash-updated (October 2025 update) 0.885 (0.86–0.90) 0.888 (0.87–0.91) 0.874 (0.85–0.90) 0.908 (0.89–0.93) 0.966 (0.95–0.98) 0.978 (0.97–0.99) 0.971 (0.96–0.98) 0.962 (0.95–0.97) gemma2:27b 0.840 (0.81–0.86) 0.844 (0.82–0.87) 0.832 (0.81–0.86) 0.906 (0.89–0.93) 0.869 (0.85–0.89) 0.840 (0.82–0.86) 0.869 (0.85–0.89) 0.882 (0.86–0.90) gemma2:2b 0.606 (0.57–0.64) 0.772 (0.74–0.80) 0.781 (0.75–0.81) 0.762 (0.73–0.79) 0.564 (0.53–0.60) 0.641 (0.61–0.67) 0.564 (0.53–0.60) 0.679 (0.65–0.71) gemma2:9b 0.833 (0.81–0.86) 0.861 (0.84–0.88) 0.835 (0.81–0.86) 0.871 (0.85–0.89) 0.860 (0.84–0.88) 0.875 (0.85–0.90) 0.831 (0.81–0.85) 0.904 (0.88–0.92) gpt-oss:20b 0.902 (0.88–0.92) 0.906 (0.89–0.92) 0.907 (0.89–0.93) 0.924 (0.91–0.94) 0.945 (0.93–0.96) 0.951 (0.94–0.96) 0.938 (0.92–0.95) 0.958 (0.94–0.97) llama3.1:8b 0.742 (0.71–0.77) 0.697 (0.67–0.73) 0.700 (0.67–0.73) 0.759 (0.73–0.79) 0.696 (0.67–0.72) 0.480 (0.45–0.51) 0.467 (0.44–0.49) 0.495 (0.47–0.52) llama3.3:70b 0.884 (0.86–0.90) 0.886 (0.87–0.91) 0.874 (0.85–0.90) 0.907 (0.89–0.93) 0.912 (0.89–0.93) 0.906 (0.89–0.92) 0.907 (0.89–0.92) 0.877 (0.86–0.90) medgemma:27b 0.870 (0.85–0.89) 0.869 (0.85–0.89) 0.864 (0.84–0.89) 0.898 (0.88–0.92) 0.913 (0.90–0.93) 0.911 (0.89–0.93) 0.886 (0.86–0.91) 0.933 (0.92–0.95) mistral-small3.2:24b 0.877 (0.85–0.90) 0.874 (0.85–0.90) 0.858 (0.83–0.88) 0.891 (0.87–0.91) 0.926 (0.91–0.94) 0.937 (0.92–0.95) 0.932 (0.92–0.95) 0.921 (0.90–0.94) mistral-small:22b 0.805 (0.78–0.83) 0.821 (0.80–0.84) 0.846 (0.82–0.87) 0.899 (0.88–0.92) 0.890 (0.87–0.91) 0.811 (0.79–0.84) 0.817 (0.79–0.84) 0.811 (0.79–0.83) mixtral:8x7b 0.830 (0.81–0.85) 0.815 (0.79–0.84) 0.837 (0.81–0.86) 0.894 (0.88–0.91) 0.886 (0.86–0.91) 0.838 (0.81–0.86) 0.839 (0.81–0.86) 0.869 (0.85–0.89) phi3:14b 0.727 (0.70–0.76) 0.767 (0.74–0.79) 0.816 (0.79–0.84) 0.819 (0.79–0.84) 0.742 (0.71–0.77) 0.688 (0.66–0.72) 0.673 (0.64–0.70) 0.638 (0.61–0.67) phi4:14b 0.890 (0.87–0.91) 0.901 (0.88–0.92) 0.890 (0.87–0.91) 0.907 (0.89–0.93) 0.914 (0.90–0.93) 0.922 (0.90–0.94) 0.932 (0.92–0.95) 0.923 (0.91–0.94) The top performing local models were: GPT-oss 20b (micro F1 0.96 [95% CI 0.94–0.97]; sensitivity 0.97 [0.95–0.98]; specificity 0.99 [0.99-1.00], PPV 0.95 [0.92–0.96], NPV 1.00 [1.00–1.00]); and Phi-4 14b (micro F1 0.93 [0.92–0.95], sensitivity 0.97 [0.95–0.98]; specificity 0.99 [0.99–0.99], PPV 0.90 [0.87–0.92], NPV 1.00 [1.00–1.00]). Overall, baseline performance in prompt 1 was moderately high across models (micro-F1 > 0.8 for models > 10b in size), and this tended to improve from prompts 1–4. Performance tended to improve from prompt 4 to 5 (introduction of refined explanations following error analysis for the development model) in the majority of models, whereas smaller models from specific LLM families (Gemma-2 2b, Llama-3.1 8b, Mistral Small 22b, Mixtral 8x7b) saw a sharp performance drop. Two advanced prompt engineering techniques (ICL and CoT) were introduced in prompts 6 and 7 respectively, building on the error analysis from prompts 5a-b. For a subset of models, there was a small performance improvement with ICL (Deepseek-R1 32b, Phi-4 14b) and CoT (Deepseek-R1 14b, Gemma-2 9b, GPT-oss 20b, Medgemma 27b) over the error analyses in prompts 5a-b. Prompts 5a and 5b differ only in the position of the clinical letter relative to the remainder of the prompt. However, this resulted in performance differences across all models. There was a meaningful performance improvement in the Gemini family models from 5a to b (e.g. micro-F1 0.920–0.954 in Gemini 1.5 Flash, and 0.969–0.975 in Gemini 2.5 Flash). This was also true for GPT-oss and the Gemma models, while the converse was true for the Llama models. Results were mixed for the Cogito, Mistral, Deepseek and Phi families. Laterality extraction achieved strong performance across the additional LLM families. Exceptions included certain smaller models (Cogito 8b, Gemma-2 2b, Phi-3 14b) which achieved lower micro-F1 scores (< 0.90) across the board and which saw a small but perceptible performance decline from prompts 5a-7 (Supplementary Table 5). Analysis of performance-cost trade-offs Figure 4 A-B shows the performance-cost and performance-time trade-offs (the “cost frontier” and “latency frontier”) across multiple LLM families and model sizes on Prompt 5b (highest performance prompt). The Pareto frontier was dominated by the Gemini family on both counts. When limited to local open-weight models only, the Pareto frontier was formed by gemma2-2b on the lower end and gpt-oss-20b on the higher end. Gemini 1.5 Flash was the most cost- and time-efficient model overall (Fig. 5 ). Results across all prompts are displayed in Supplementary Figs. 5–7. Overall, when analysed across all eight prompts, the Pareto frontier for all models was formed by the Gemini models on the lower end and two local models (phi4-14b and gpt-oss-20b) on the higher end from prompts 1–4. Performance improvement diverged from prompt 5a onwards, with larger improvements seen in the Gemini models and smaller improvements in the two local models, meaning that these local models no longer formed part of the overall Pareto frontier. Error taxonomy The error taxonomy and representative examples for each category are presented in Table 2 . Two key classes emerged from the analysis, which are described in brief below: Table 2 Taxonomy of errors, categorised as model-centric and data-centric errors. The table outlines the classification framework used for the analysis, detailing the eight identified error categories with definitions and illustrative examples from the clinical letters. Theme Examples Model-centric errors Theme 1: Deficiencies in domain knowledge - specifically ophthalmic knowledge such as subtypes, anatomy E.g. the classification task for ‘CNV’ required an understanding that conditions such as idiopathic polypoidal choroidal vasculopathy and retinal angiomatous proliferation are subtypes of CNV. Theme 2: Errors in clinical inference (following correct identification of diagnoses or findings) E.g. Correctly identifying the presence of intraretinal and/or subretinal fluid in the letter, but inferring this to represent ‘macular oedema’ (swelling of the macula due to fluid build-up). In reality, macular oedema is a specific description for leakage-driven fluid build-up, whereas subretinal fluid is not specific to macular oedema, and can be seen in the context of multiple diseases including CSCR Theme 3: Linguistic errors, specifically coordination and reference errors, can occur when a statement lists multiple items, to which the model incorrectly links a modifier due to a failure to correctly parse complex grammatical structures. E.g. parsing the sentence “He was referred for a macular hole in his right eye and an epiretinal membrane in his left eye. The former was not evident and the latter was mild”, requires a multi-step logical inference that extends beyond simple pattern matching. Theme 4: Misinterpretation of context. E.g. the model fails to recognise that a referral reason (e.g. “Mr X was referred by his opticians for a macular hole”) does not constitute a current and definitive diagnosis, particularly as subsequent findings from a clinical examination, objective test, or the final impression may not support this. Theme 5: Hallucinations, where the model fabricates information (such as diagnoses or findings) that is not present in the text Data-centric errors Theme 6: Conflicting information was occasionally present in the source text. Transcription errors, e.g. where the problem list (which may be automatically copied over from a previous clinical encounter) might state “mild non-proliferative diabetic retinopathy with macular oedema”, whereas the description of a test such as an optical coherence tomography (OCT) scan later in the same letter might report “macula dry, exudates only”, which would rule out the presence of macular oedema and should carry a higher epistemic weight due to its objective nature. Temporality, e.g. where the problem list might state ‘full-thickness macular hole’ and be followed by a further surgical procedure (e.g. ‘PPV/ ILM peel/ gas’) to treat the condition, and would typically imply surgical success unless specified otherwise. This would therefore mean that the condition was no longer present at the time of the clinical visit. Theme 7: Clinical uncertainty and ambiguity or hedging in the source text Some clinicians used phrases such as “suspicious for,” “features suggestive of,” “possible early…”, or “query [condition]” to propose a possible diagnosis which would require further monitoring or investigation, while clearly favouring a specific diagnosis in the text. In addition, a clinical letter might list several potential differential diagnoses to explain the findings (e.g. “subretinal fluid could represent either CSCR or early CNV”). Firstly, predominantly model-centric errors, which reflect deficiencies in the LLM’s domain knowledge (Theme 1, e.g. not knowing that idiopathic polypoidal choroidal vasculopathy is a subtype of CNV); errors in reasoning for clinical inference (Theme 2, e.g. incorrectly inferring that subretinal fluid meant macular oedema); failure to parse linguistic structures (Theme 3, e.g. incorrectly linking modifiers in coordinated sentences); misinterpretation of context (Theme 4, e.g. mistaking a referral reason for a definitive diagnosis); or “hallucinations”, where models confabulate or fabricate non-existent information such as diagnoses and findings (Theme 5). Secondly, data-centric errors, which arise from the inherent complexities and ambiguities of the clinical letters themselves. This includes conflicting information arising from competing diagnoses (most often due to transcription errors) or temporal differences where a condition had been successfully treated but both the original diagnosis and the treatment was included (Theme 6), as well as clinical ambiguity or uncertainty, such as the use of hedging phrases like “suspicious for” or listing multiple differential diagnoses while clearly favouring a specific diagnosis in the text (Theme 7). Informed by these key themes, the prompt was refined to mitigate errors identified in the error analysis (Prompt 5a-b). A detailed description of each condition and its clinical meaning was provided to address theme 1 and 2. To address Theme 6, an evidence hierarchy was added to the core principles, prioritising objective investigations such as OCT scans, followed by examination findings, then by the problem list. Instructions to disregard referral reasons were included given that discrepancies between referral reasons and the final specialist diagnosis was not uncommon. The instructions on status were revised for clarity to address the errors in Theme 7. Error analysis A random sample of 25 letters (representing 512 false positives and false negatives across all models and prompts) was drawn. These errors were classified independently by an ophthalmologist (AYO) and an “LLM judge” according to the error taxonomy. Substantial agreement was achieved between the human and LLM judge (Cohen’s κ 0.909) (Supplementary Fig. 8). The optimised prompt was then applied to the complete dataset of model-generated errors to quantify their distribution and observe patterns and trends across models and prompts. The distribution of error themes by model and prompt are presented in Supplementary Figs. 9–10. The most common errors were errors in domain knowledge (theme 1) and clinical inference (theme 2) as well as handling conflicting information within the clinical text (theme 6). These errors were mostly addressed with more directed prompting but could not be completely eliminated. While prompts 5a-5b were designed to address the most common errors in the development model (Gemini 1.5 Flash), the benefits of this prompting strategy extended across other models/ model families. However, this pattern did not extend to smaller models (< 10b parameters) - error rates (particularly hallucinations) increased with length and depth of detail provided in the prompt. Hallucinations were far rarer in large local and proprietary models. Generalisability in external validation To validate the final pipeline, the top-performing prompt (prompt 5b) was applied to an unseen dataset of 300 letters. The pipeline demonstrated strong generalisability for diagnosis detection across all 9 classes in the same LLM family: Gemini 1.5 Flash (micro-averaged F1 0.945, 95% CI 0.920–0.966), Gemini 1.5 Pro (0.960, 95% CI 0.939–0.978), and Gemini 2.5 Flash (0.980, 95% CI 0.964–0.993). The same was true for laterality: Gemini 1.5 Flash (0.941, 95% CI 0.904–0.973), Gemini 1.5 Pro (0.912, 95% CI 0.870–0.948), and Gemini 2.5 Flash (0.974, 95% CI 0.948–0.995). Gemini 2.5 Flash was selected as the optimal configuration for scaling up the pipeline to the full dataset because of the optimal balance of performance, cost, and efficiency identified from the above experiments: micro-F1 0.975 (95% CI 0.965–0.984); USD 0.00199/ letter; 7.8s (IQR 6.81–9.20) per letter. Discussion In this study, we developed a structured framework for clinical information extraction from real-world ophthalmic letters, which combines iterative prompt refinement, rigorous error analysis, and operational benchmarking with the aim of providing generalisable insights. We demonstrate that LLMs can achieve a pre-specified high-performance threshold for extracting ophthalmic diagnoses and laterality from unstructured clinical letters without the need for elaborate feature engineering. Using this approach, Gemini models achieved our pre-specified target of micro-average F1 ≥ 0.95 from Prompt 5b onwards, and a local open-weight model (gpt-oss-20b) reached comparable performance from Prompt 5a onwards. Our work was grounded in a pragmatic examination of the trade-off between model performance and costs to develop a scalable and resource-efficient pipeline. Further contributions include an error taxonomy for data extraction from clinical letters. Prompt engineering techniques and iterative prompt refinement Information extraction is a classic NLP task which has traditionally relied on rule-based systems or fine-tuned encoder models such as BERT and its derivatives. 29 A recent scoping review of LLM-based approaches to information extraction from radiology reports found that the majority (28/34, 82%) used BERT-based models, 30 although there is increasing interest in employing generative LLMs for clinical information extraction. While some studies report that LLM performance improves with fine-tuning, 31,32 recent work suggests that prompt engineering alone may suffice in this regard. 12 – 15 However, testing on a realistic spectrum of real-world data is essential, as small datasets or synthetic data limits clinical applicability. 33 A key insight from our work is that familiarity with the behaviour of the model or model family is important. While the methodology applied the same broad prompt refinement strategy across different model architectures, different models possessed distinct sensitivities, such as tolerance to prompt length, susceptibility to “context rot”, and response to text positioning. For example, we showed that performance in models from the Gemini family improved when the clinical letter was introduced immediately after the role and task in the prompt, whereas this was the opposite for the Llama and Mistral models. This is analogous to previous work exploring model context length, which found that specific LLMs performed better at extracting data from the very beginning (primacy bias) and very end (recency bias) of their input prompt context windows, 34 and extends work done on short single sentence prompts which found that prompt position does matter. 35 Awareness of these distinctions is essential when designing and optimising prompts for specific models and tasks. In addition, advanced prompt engineering techniques did not always improve model performance. Some models (e.g. Deepseek-R1 32b, GPT-oss 20b, Phi-4 14b) showed improvement with few-shot learning, though others (e.g. Gemini 1.5 Flash, 1.5 Pro, 2.5 Flash) demonstrated slight performance deterioration. This heterogeneity reinforces the importance of empirical prompt testing, even within the same model family, rather than assuming more elaborate prompting yields better results. The nature of real-world clinical texts and the need for error analysis Overall, our findings support the use of iterative prompt refinement as a systematic and reproducible approach for improving model reliability and interpretability. Beyond improving model performance, this can also serve as a diagnostic tool for understanding model behaviour and generalisation. However, as prompts interact closely with the data they are applied to, model performance ultimately depends on the irregularities, ambiguities, and inconsistencies inherent in real-world clinical text. Error analysis showed that model performance was influenced by both inherent model limitations (“model-centric errors”) as well as the complexity of the data itself (“data-centric errors”). The recurring failure modes identified included gaps in domain knowledge and errors in clinical inference, contextual misattribution, and lexical ambiguity. These findings suggest that LLMs are not only constrained by their training data but also by the inherent irregularity and contextual richness of real-world clinical documentation itself. These often contain idiosyncratic shorthand, inconsistent formatting and abbreviations, and implicit reasoning that challenge deterministic parsing, 36 highlighting the need for model testing on a range of real-world data reflective of clinical realities beyond highly-curated benchmark datasets 37 and reinforcing the importance of systematic error analysis. For example, practicing clinicians will recognise that discrepancies frequently exist between the referrer’s diagnosis and the specialist’s final diagnosis. 38 , 39 A recent ophthalmology audit found that the referral reasons for OCT abnormalities provided by primary eye care professionals differed from the final diagnosis made by a consultant ophthalmologist in 61.2% of cases. 38 In addition, conflicting information can frequently arise from transcription or documentation errors in busy high-throughput clinical environments. This is often less consequential for clinicians, who can resolve these conflicts through contextual knowledge and understanding, but poses a significant challenge for LLMs. Our systematic error analysis subsequently allowed us to articulate an 'evidence hierarchy' that mirrors the clinician's thinking process for conflict resolution. In this study, quantifying and qualifying these errors through a structured error taxonomy facilitated targeted remediation via iterative prompt refinement. Recognising and systematising these error patterns moves evaluation beyond global performance metrics toward interpretability - understanding why errors occur and where they matter most. This approach supports safer deployment by focusing attention on high-risk failure modes and prioritising quality improvement at the interface between language, context, and clinical meaning. However, there are a small proportion of errors that cannot be addressed through prompt refinement alone. Addressing epistemic uncertainty from the model’s intrinsic knowledge gaps may in some cases require fine-tuning or domain adaptation, although this must be weighed against the operational simplicity and generalisability of prompt-based optimisation. Implications for real-world deployment LLMs’ information extraction capabilities can be leveraged to detect adverse events and medication-related outcomes from EHR records, which holds significant potential for enhancing downstream pharmacovigilance and pharmaco-epidemiology tasks to support postmarket surveillance of medical products. 41 In this study, we extend this proposal to show how LLMs could be used to support registry curation, large-scale audit, or algorithmovigilance 42 (monitoring of AI embedded in healthcare systems for clinical care post-deployment), which could take the form of institutional dashboards overseen by AI safety committees to ensure oversight aligns with regulatory obligations and clinical accountability. From our experience, there are several practical challenges that need to be addressed in order to fully realise this potential. While high performance is a fundamental requirement, the operational costs and time required to achieve this is perhaps equally important for large-scale deployment. The Pareto frontier is a concept adopted from economics, 43 which takes into account the trade-offs between two objectives (e.g. cost and accuracy) to identify the set of optimal solutions for a multi-objective optimisation problem. It provides a pragmatic framework for examining these trade-offs, identifying models that deliver near-optimal performance relative to their computational and financial burden. Application to medical AI tasks has been limited, although a recent study comparing LLM performance in medical question answering has employed this to identify Pareto efficient configurations. 44 In this study, we extend its application to the more granular and complex task of multi-class clinical data extraction from real-world clinical letters, in addition to attempting to synthesise the trade-offs between three factors that are critical to LLM deployment on real-world clinical data (performance vs cost and latency), rather than the standard two. Balancing these factors is a strategic decision shaped by institutional priorities, financial resources, and infrastructural constraints. The Pareto frontier analysis was helpful here in determining the optimal model to take forward for deployment at scale, as the capacity to process thousands of records per hour may sometimes be more valuable than marginal improvements in accuracy, for example. Further practical considerations include institutional policies, which often determine whether and which proprietary models can be used, while computational infrastructure may limit the feasibility of large locally-deployed LLMs, favouring smaller alternatives which may have performance limitations depending on the specific task. Furthermore, reliance on proprietary frontier LLMs introduces risk of performance or algorithmic drift due to unannounced model updates, 45,46 which led us to test the performance and the interoperability of our approach on both frontier and local models available to us. In addition, data drift over time can arise from evolving documentation or coding conventions, which can degrade performance. Overall, these factors underscore the need for continuous monitoring and error analysis within a structured surveillance framework to support safe deployment at scale. Future work Future work will focus on operationalising this methodology into a clinician-friendly code-free software platform that can be seamlessly integrated into real-world clinical and research workflows. Although our results demonstrated strong performance across multiple models and model families, the probabilistic nature of LLMs may necessitate additional safeguards for large-scale deployment. This may include developing uncertainty metrics with human-in-the-loop review of cases flagged as low confidence. For fully automated applications, scalable audit mechanisms such as random or stratified sampling over time or subgroup analyses will be essential for monitoring the data extraction pipeline prospectively to detect drift and maintain reliability. Finally, extending beyond classification toward richer clinical concept extraction and normalisation (e.g. mapping clinician-described conditions to standard ontologies) could broaden downstream utility while preserving interpretability. Balancing this increased semantic depth with the complexity of evaluation as well as computational efficiency for operationalisation will support the next phase of real-world implementation. Strengths and limitations Strengths of this study include the rigorous and clinically-grounded design led by domain experts familiar with the nuances of ophthalmic documentation and the operational realities of clinical workflows. Dataset creation was deliberately structured to balance real-world sampling with enrichment to minimise the impact of class imbalance, supported by gold standard labelling and external validation to confirm generalisability. Secondly, our evaluation framework was designed to identify the model of greatest utility and to move beyond standard accuracy metrics to assess operational efficiency by employing a Pareto analysis to formally identify models with the optimal value. Finally, we have conducted a qualitative error analysis and developed an error taxonomy for ophthalmic clinical letters, which facilitated the identification of model failure modes and the evaluation of how inherent ambiguities in clinical documentation drive these failures - insights which are transferable to other clinical data extraction domains. In terms of limitations, we focused on a single language (English) as this is the predominant language in the UK. Despite external validation across subspecialties and time, our findings are based on clinical letters from a single institution. However, as Moorfields is the largest eye hospital in the UK, comprising 27 networked sites, the dataset captures a significant diversity of patient populations and clinicians, including writing styles. This inherent heterogeneity provides a degree of robustness and makes our findings more representative than typical single-site studies, although generalisability to other countries and healthcare systems should be tested as well. Nevertheless, the conceptual framework used to develop the pipeline should be broadly applicable. In addition, while our use of a standardised prompt was necessary for a fair comparison, this may not reflect the peak performance that each model could achieve with a tailored prompt. However, it provides a more pragmatic measure of a model's “out-of-the-box” usability and the portability of our prompt engineering, a critical factor for real-world deployment given the unfeasibility of designing and maintaining numerous bespoke prompts. Finally, given the rapid pace of development in LLMs, our performance ratings should be considered a snapshot in time, although we believe the framework itself has enduring value. Conclusion We have demonstrated that a structured and iterative approach to prompt refinement can be used to efficiently leverage LLMs for real-world clinical information extraction at scale, thereby transforming unstructured text into structured high-fidelity data for downstream tasks. Through employing systematic error characterisation and Pareto frontier analyses for cost and latency, we reframe the evaluation of LLMs from a narrow focus on performance to a broader operational perspective - essential considerations that determine whether these systems can be scaled safely and in a resource-efficient manner. This framework may help lay the foundation for next-generation data pipelines that can accelerate scientific discovery and power continuous learning health systems. Abbreviations AI, artificial intelligence CSCR, central serous chorioretinopathy CNV, choroidal neovascularisation EHR, electronic health records ERM, epiretinal membrane FTMH, full-thickness macular hole GA, geographic atrophy LLM, large language model NLP, natural language processing OCT, optical coherence tomography VMT, vitreomacular traction Declarations This study was approved by the Moorfields Audit committee. Author contributions AYO conceptualised and coordinated the study, and performed data acquisition (together with IB), analysis and interpretation. AYO and QNN developed and executed the programming code, with QNN providing technical direction and implementation support. AYO prepared the first draft of the manuscript, which was critically reviewed and revised by all authors (AYO, QN, IB, JE, FA, MS, DAM, LJ, ED, YZ, GM, YT, AKD, PAK), who have read and approved the final manuscript. Acknowledgements AYO is supported by a National Institute for Health Research (NIHR) - Moorfields Eye Charity (MEC) Doctoral Fellowship (NIHR303691). PAK is supported by a UK Research & Innovation Future Leaders Fellowship (MR/T019050/1), Moorfields Eye Charity with The Rubin Foundation Charitable Trust (GR001753), and an Alcon Research Institute Senior Investigator Award. The views expressed in this publication are those of the authors and not necessarily those of the abovementioned funding bodies. We also thank Dr Siegfried K Wagner for his comments on a previous version of the manuscript. Competing interests FA is an equity owner in SIMA Surgical Intelligence Inc. PAK is a cofounder of Cascader Ltd. and has acted as a consultant for Retina Consultants of America, Roche, Boehringer Ingelheim, and Bitfount, and is an equity owner in Big Picture Medical. He has received speaker fees from Zeiss, Thea, Apellis, and Roche. He has received travel support from Bayer and Roche. He has attended advisory boards for Topcon, Bayer, Boehringer Ingelheim, and Roche. None of the other authors report any conflicts of interest. Code availability The code used in this study is not publicly available. It was developed for use within a secure clinical computing environment and contains components specific to local data structures and governance requirements. The code may be made available for academic review upon reasonable request, subject to institutional approval. Data availability The source data consist of routinely collected clinical letters containing sensitive patient information and cannot be shared publicly due to information governance constraints. All summary data, aggregated results, and analyses required to interpret the findings are provided in the manuscript and supplementary materials. References Jamieson T, Ailon J, Chien V, Mourad O. An electronic documentation system improves the quality of admission notes: a randomized trial. J Am Med Inform Assoc . 2017;24(1):123–129. doi: 10.1093/jamia/ocw064 Amarasingham R, Plantinga L, Diener-West M, Gaskin DJ, Powe NR. Clinical information technologies and inpatient outcomes: a multiple hospital study. Arch Intern Med . 2009;169(2):108–114. doi: 10.1001/archinternmed.2008.520 Holmes JH, Beinlich J, Boland MR, et al. Why Is the Electronic Health Record So Challenging for Research and Clinical Care? Methods Inf Med . 2021;60(1–02):32–48. doi: 10.1055/s-0041-1731784 Cowie MR, Blomster JI, Curtis LH, et al. Electronic health records to facilitate clinical research. Clin Res Cardiol . 2017;106(1):1–9. doi: 10.1007/s00392-016-1025-6 Botsis T, Hartvigsen G, Chen F, Weng C. Secondary Use of EHR: Data Quality Issues and Informatics Opportunities. Summit Transl Bioinform . 2010;2010:1–5. Kong HJ. Managing Unstructured Big Data in Healthcare System. Healthc Inform Res . 2019;25(1):1–2. doi: 10.4258/hir.2019.25.1.1 Wu H, Wang M, Wu J, et al. A survey on clinical natural language processing in the United Kingdom from 2007 to 2022. npj Digit Med . 2022;5(1):1–15. doi: 10.1038/s41746-022-00730-6 Fu S, Chen D, He H, et al. Clinical concept extraction: A methodology review. Journal of Biomedical Informatics . 2020;109:103526. doi: 10.1016/j.jbi.2020.103526 Bazoge A, Morin E, Daille B, Gourraud PA. Applying Natural Language Processing to Textual Data From Clinical Data Warehouses: Systematic Review. JMIR Medical Informatics . 2023;11(1):e42477. doi: 10.2196/42477 Naveed H, Khan AU, Qiu S, et al. A Comprehensive Overview of Large Language Models. arXiv . Preprint posted online October 17, 2024:arXiv:2307.06435. doi: 10.48550/arXiv.2307.06435 Ayhan MS, Ong AY, Ruffell E, Wagner SK, Merle DA, Keane PA. In-context learning for data-efficient classification of diabetic retinopathy with multimodal foundation models. medRxiv . Preprint posted online March 10, 2025: 2025.03.09.25323618 . doi:10.1101/2025.03.09.25323618 Hein D, Christie A, Holcomb M, et al. Iterative refinement and goal articulation to optimize large language models for clinical information extraction. npj Digit Med . 2025;8(1):301. doi: 10.1038/s41746-025-01686-z Wihl J, Rosenkranz E, Schramm S, et al. Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines. European Radiology Experimental . 2025;9(1):61. doi: 10.1186/s41747-025-00600-2 Wiest IC, Ferber D, Zhu J, et al. Privacy-preserving large language models for structured medical information retrieval. npj Digit Med . 2024;7(1):257. doi: 10.1038/s41746-024-01233-2 Wiest IC, Verhees FG, Ferber D, et al. Detection of suicidality from medical text using privacy-preserving large language models. Br J Psychiatry . 225(6):532–537. doi: 10.1192/bjp.2024.134 NHS Digital. Hospital Outpatient Activity 2019-20. NHS Digital. 2020. Accessed May 27, 2021. https://digital.nhs.uk/data-and-information/publications/statistical/hospital-outpatient-activity/2019-20/summary-report---treatment-specialities Radell JE, Tatum JN, Lin CT, et al. Risks and rewards of increasing patient access to medical records in clinical ophthalmology using OpenNotes. Eye . 2022;36(10):1951–1958. doi: 10.1038/s41433-021-01775-9 De Fauw J, Ledsam JR, Romera-Paredes B, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. Nat Med . 2018;24(9):9. doi: 10.1038/s41591-018-0107-6 Kraljevic Z, Shek A, Yeung JA, et al. Validating Transformers for Redaction of Text from Electronic Health Records in Real-World Healthcare. In: 2023 IEEE 11th International Conference on Healthcare Informatics (ICHI) . 2023:544–549. doi: 10.1109/ICHI57859.2023.00098 Information Commissioner’s Office. What is personal data? November 27, 2024. Accessed December 19, 2024. https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal-information-what-is-it/what-is-personal-data/what-is-personal-data/ Robertson S, Zaragoza H. The Probabilistic Relevance Framework: BM25 and Beyond. INR . 2009;3(4):333–389. doi: 10.1561/1500000019 Jin Q, Kim W, Chen Q, et al. MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics . 2023;39(11):btad651. doi: 10.1093/bioinformatics/btad651 Cormack GV, Clarke CLA, Buettcher S. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval . Published online July 19, 2009:758–759. doi: 10.1145/1571941.1572114 Braun V, Clarke V. Thematic analysis. In: APA Handbook of Research Methods in Psychology, Vol 2: Research Designs: Quantitative, Qualitative, Neuropsychological, and Biological . APA handbooks in psychology®. American Psychological Association; 2012:57–71. doi: 10.1037/13620-004 Continuing to bring you our latest models, with an improved Gemini 2.5 Flash and Flash-Lite release- Google Developers Blog. Accessed December 7, 2025. https://developers.googleblog.com/en/continuing-to-bring-you-our-latest-models-with-an-improved-gemini-2-5-flash-and-flash-lite-release/ ollama/ollama. Published online October 29, 2025. Accessed October 30, 2025. https://github.com/ollama/ollama Dettmers T, Zettlemoyer L. The case for 4-bit precision: k-bit inference scaling laws. In: Proceedings of the 40th International Conference on Machine Learning . Vol 202. ICML’23. JMLR.org; 2023:7750–7774. McHugh ML. Interrater reliability: the kappa statistic. Biochem Med (Zagreb) . 2012;22(3):276–282. Singh S. Natural Language Processing for Information Extraction. arXiv . Preprint posted online July 6, 2018:arXiv:1807.02383. doi: 10.48550/arXiv.1807.02383 Reichenpfader D, Müller H, Denecke K. A scoping review of large language model based approaches for information extraction from radiology reports. npj Digit Med . 2024;7(1):1–12. doi: 10.1038/s41746-024-01219-0 Losch N, Plagwitz L, Büscher A, Varghese J. Fine-Tuning LLMs on Small Medical Datasets: Text Classification and Normalization Effectiveness on Cardiology reports and Discharge records. arXiv . Preprint posted online March 27, 2025:arXiv:2503.21349. doi: 10.48550/arXiv.2503.21349 Akbasli IT, Birbilen AZ, Teksam O. Leveraging large language models to mimic domain expert labeling in unstructured text-based electronic healthcare records in non-english languages. BMC Med Inform Decis Mak . 2025;25(1):154. doi: 10.1186/s12911-025-02871-6 Ntinopoulos V, Rodriguez Cetina Biefer H, Tudorache I, et al. Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation. BMJ Health Care Inform . 2025;32(1):e101139. doi: 10.1136/bmjhci-2024-101139 Liu NF, Lin K, Hewitt J, et al. Lost in the Middle: How Language Models Use Long Contexts. arXiv . Preprint posted online November 20, 2023:arXiv:2307.03172. doi: 10.48550/arXiv.2307.03172 Mao J, Middleton SE, Niranjan M. Do prompt positions really matter? arXiv . Preprint posted online June 28, 2024:arXiv:2305.14493. doi: 10.48550/arXiv.2305.14493 Long WJ. Parsing Free Text Nursing Notes. AMIA Annu Symp Proc . 2003;2003:917. Panch T, Pollard TJ, Mattie H, Lindemer E, Keane PA, Celi LA. “Yes, but will it work for my patients?” Driving clinically relevant research with benchmark datasets. npj Digit Med . 2020;3(1):87. doi: 10.1038/s41746-020-0295-6 Ong AY, Naughton A, Hornby S, Shwe-Tin A. Impact of an email advice service on filtering and refining ophthalmology referrals in England. Int Ophthalmol . 2023;43(11):4019–4025. doi: 10.1007/s10792-023-02806-y Stunkel L, Sharma RA, Mackay DD, et al. Patient Harm Due to Diagnostic Error of Neuro-Ophthalmologic Conditions. Ophthalmology . 2021;128(9):1356–1362. doi: 10.1016/j.ophtha.2021.03.008 Chen BY, Antaki F, Gonzalez M, et al. Automated Identification of Stroke Thrombolysis Contraindications from Synthetic Clinical Notes: A Proof-of-Concept Study. Cerebrovasc Dis Extra . 2025;15(1):130–136. doi: 10.1159/000545317 Matheny ME, Yang J, Smith JC, et al. Enhancing Postmarketing Surveillance of Medical Products With Large Language Models. JAMA Netw Open . 2024;7(8):e2428276. doi: 10.1001/jamanetworkopen.2024.28276 Balendran A, Benchoufi M, Evgeniou T, Ravaud P. Algorithmovigilance, lessons from pharmacovigilance. npj Digit Med . 2024;7(1):270. doi: 10.1038/s41746-024-01237-y Weck OLD. Multiobjective optimisation: history and promise. Published online 2004. Antaki F, Mikhail D, Milad D, et al. Performance of GPT-5 Frontier Models in Ophthalmology Question Answering. arXiv . Preprint posted online August 13, 2025:arXiv:2508.09956. doi: 10.48550/arXiv.2508.09956 Chen L, Zaharia M, Zou J. How is ChatGPT’s behavior changing over time? arXiv . Preprint posted online October 31, 2023:arXiv:2307.09009. doi: 10.48550/arXiv.2307.09009 Nature Machine Intelligence. What is in your LLM-based framework? Nat Mach Intell . 2024;6(8):845–845. doi: 10.1038/s42256-024-00896-6 Nature Machine Intelligence. What is in your LLM-based framework? Nat Mach Intell . 2024;6(8):845–845. doi: 10.1038/s42256-024-00896-6 Additional Declarations Competing interest reported. FA is an equity owner in SIMA Surgical Intelligence Inc. PAK is a cofounder of Cascader Ltd. and has acted as a consultant for Retina Consultants of America, Roche, Boehringer Ingelheim, and Bitfount, and is an equity owner in Big Picture Medical. He has received speaker fees from Zeiss, Thea, Apellis, and Roche. He has received travel support from Bayer and Roche. He has attended advisory boards for Topcon, Bayer, Boehringer Ingelheim, and Roche. None of the other authors report any conflicts of interest. Supplementary Files SupplementaryMaterials.docx Cite Share Download PDF Status: Under Review Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8921439","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":600795951,"identity":"c99f8c0a-b886-4d8f-9c27-6d12a3826dc9","order_by":0,"name":"Ariel Yuhan Ong","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAzUlEQVRIiWNgGAWjYFACxgaGByCavYGNBC0JIJrnAEQLD1G6wFokEojUwj/7cOODBAabPPnIt8ce/mC4I2dPSIvEucRmgwSGtGLD23npxjwMz4wJO+wMY5tEAsPhxI2zc8ykGYCMHkI65M8wtv8Aa5l5xkzyB8PheoJaDIC2MIC0zJfgMZPgYTicQNBhhmcYmyUSDNISN/AAHcZjcNiw5wABLXJn2B9++FBhkzi/HeSwisPy7A2ErIE4D4gOQBnEA3niDB8Fo2AUjIKRCABlTzvvL9lQpwAAAABJRU5ErkJggg==","orcid":"","institution":"University College London","correspondingAuthor":true,"prefix":"","firstName":"Ariel","middleName":"Yuhan","lastName":"Ong","suffix":""},{"id":600795952,"identity":"517954ab-28dd-4593-8b8a-6ead396e1836","order_by":1,"name":"Quang Nguyen","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Quang","middleName":"","lastName":"Nguyen","suffix":""},{"id":600795956,"identity":"ff3b33e9-88fd-44c4-ae52-183d8d1567f2","order_by":2,"name":"Ishani Barai","email":"","orcid":"","institution":"Moorfields Eye Hospital NHS Foundation Trust","correspondingAuthor":false,"prefix":"","firstName":"Ishani","middleName":"","lastName":"Barai","suffix":""},{"id":600795957,"identity":"82c7cd2c-abde-46dd-b3f7-c172234fe43e","order_by":3,"name":"Justin Engelmann","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Justin","middleName":"","lastName":"Engelmann","suffix":""},{"id":600795958,"identity":"fe7f56cc-8bf1-4df7-8778-e26386b5a5a5","order_by":4,"name":"Fares Antaki","email":"","orcid":"","institution":"Centre Hospitalier de l'Universite de Montreal","correspondingAuthor":false,"prefix":"","firstName":"Fares","middleName":"","lastName":"Antaki","suffix":""},{"id":600795959,"identity":"09153d56-6727-4c16-8210-071dcb577283","order_by":5,"name":"Mertcan Sevgi","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Mertcan","middleName":"","lastName":"Sevgi","suffix":""},{"id":600795960,"identity":"d08d3b05-64da-4b0e-aa16-5d7c9b5736af","order_by":6,"name":"David A Merle","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"David","middleName":"A","lastName":"Merle","suffix":""},{"id":600795961,"identity":"88292493-3908-45c7-afeb-438b056e8729","order_by":7,"name":"Lie Ju","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Lie","middleName":"","lastName":"Ju","suffix":""},{"id":600795962,"identity":"6f28e7f9-5549-4f08-86dc-1d02238141a2","order_by":8,"name":"Eliot Dow","email":"","orcid":"","institution":"Retinal Consultants Medical Group","correspondingAuthor":false,"prefix":"","firstName":"Eliot","middleName":"","lastName":"Dow","suffix":""},{"id":600795963,"identity":"944dab9d-592b-4b4a-875d-e30cda558c4f","order_by":9,"name":"Yukun Zhou","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Yukun","middleName":"","lastName":"Zhou","suffix":""},{"id":600795964,"identity":"09e25210-017e-49d4-8d4c-c72bea315d3a","order_by":10,"name":"Gregory Maniatopoulos","email":"","orcid":"","institution":"University of Leicester","correspondingAuthor":false,"prefix":"","firstName":"Gregory","middleName":"","lastName":"Maniatopoulos","suffix":""},{"id":600795965,"identity":"373725e5-82d3-4577-8959-c1426a3760a6","order_by":11,"name":"Yemisi Takwoingi","email":"","orcid":"","institution":"University of Birmingham","correspondingAuthor":false,"prefix":"","firstName":"Yemisi","middleName":"","lastName":"Takwoingi","suffix":""},{"id":600795966,"identity":"5d9f27cd-a01d-4a12-8f6f-a8636dc94aa1","order_by":12,"name":"Alastair K Denniston","email":"","orcid":"","institution":"University of Birmingham","correspondingAuthor":false,"prefix":"","firstName":"Alastair","middleName":"K","lastName":"Denniston","suffix":""},{"id":600795967,"identity":"ff913523-88f6-4311-a074-1573d29dbf49","order_by":13,"name":"Pearse A Keane","email":"","orcid":"","institution":"University College London","correspondingAuthor":false,"prefix":"","firstName":"Pearse","middleName":"A","lastName":"Keane","suffix":""}],"badges":[],"createdAt":"2026-02-20 00:53:40","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8921439/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8921439/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":104406152,"identity":"ecf62ad7-641e-47f7-a478-72de71246db8","added_by":"auto","created_at":"2026-03-11 12:24:55","extension":"jpg","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":46405,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFigure depicting the study workflow. \u003c/strong\u003e600 clinical letters were randomly sampled and enriched to develop a test dataset with a minimum of 40 positive cases per class. This dataset was labelled by two ophthalmologists; discrepancies were discussed and used to refine labelling guidelines. Data extraction was performed using a large language model (LLM) - an iterative approach to prompt engineering was adopted, and prompt refinement was informed by error analyses. The best-performing prompt was tested in an external dataset to evaluate its generalisability to other letters. The prompting strategy was also tested for generalisability to other LLM backbones.\u003c/p\u003e","description":"","filename":"Picture1.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8921439/v1/990b32cb6afc172b4e751b23.jpg"},{"id":104345235,"identity":"96971279-f923-48bf-9f19-138abebd5fe9","added_by":"auto","created_at":"2026-03-10 17:35:25","extension":"jpg","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":48910,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eFigure depicting the iterative modular approach employed to refine the prompt, \u003c/strong\u003ewith prompt 1 featuring only basic task instructions. Prompt 6 (few-shot learning) did not improve performance and was therefore not kept in the next iteration (prompt 7).\u003c/p\u003e","description":"","filename":"Picture2.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8921439/v1/711fa40d66ecb8f36a369e8a.jpg"},{"id":104345237,"identity":"5431ef28-0f41-4622-8f37-6169aa3dc759","added_by":"auto","created_at":"2026-03-10 17:35:26","extension":"jpg","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":71239,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eLine plots featuring the trajectory of performance changes across the iterative prompt refinement cycle, in terms of micro-averaged F1 scores, with each model represented by a single coloured line. (A) shows the Gemini family of models only - Gemini 1.5 flash (“development” model), Gemini 1.5 pro, and Gemini 2.5 flash. (B) shows a line plot featuring all models included in this study across all prompts.\u003c/strong\u003e\u003c/p\u003e","description":"","filename":"Picture3.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8921439/v1/87d994ed4c65cbc86af786ca.jpg"},{"id":104780050,"identity":"0d523db1-7a3f-4984-829d-7bc9ea9c36c2","added_by":"auto","created_at":"2026-03-17 07:49:49","extension":"jpg","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":103889,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003ePareto frontier analysis of LLM configurations across multiple LLM families and model sizes. \u003c/strong\u003eThis figure presents the optimal criteria used to identify the most resource-efficient model configurations, balancing the primary performance metric (micro-averaged F1 score) against two key operational constraints, cost and latency. The Pareto frontier represents the set of non-dominated models where no other configuration achieves better performance and lower cost/ time, with (A) featuring the performance-cost trade-off (the cost frontier), and (B) the performance-time trade-off (the latency frontier). We present the Pareto frontier for all models (black line) and for local models only (red dotted line).\u003c/p\u003e","description":"","filename":"Picture4.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8921439/v1/c30271ceb5aeda0a2af7c9b4.jpg"},{"id":104345239,"identity":"d5e7088f-618e-4765-bc69-9c1e9684fdca","added_by":"auto","created_at":"2026-03-10 17:35:26","extension":"jpg","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":42793,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eBubble chart depicting the cost and time trade-offs across multiple LLM families and model size, with performance (F1) represented by the bubble size. \u003c/strong\u003eThe axes are plotted on a log-log scale to better show the orders of magnitude difference in resource consumption. More efficient models are represented by the largest bubbles (high F1 score) in relation to lower time and/or cost (nearer the graph origin at the bottom left corner).\u003c/p\u003e","description":"","filename":"Picture5.jpg","url":"https://assets-eu.researchsquare.com/files/rs-8921439/v1/842d106882b97a8cd31b88c8.jpg"},{"id":104784222,"identity":"fb1ff839-4ace-4580-8b7b-b519a70b8526","added_by":"auto","created_at":"2026-03-17 08:05:53","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":2079023,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8921439/v1/a612996d-937b-41b2-9f9b-cf8bc52e88fc.pdf"},{"id":104345240,"identity":"c983b1e5-c2ee-49cd-86d3-3edf840984c7","added_by":"auto","created_at":"2026-03-10 17:35:26","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":7397900,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryMaterials.docx","url":"https://assets-eu.researchsquare.com/files/rs-8921439/v1/362d31fbbe01b18aa7294023.docx"}],"financialInterests":"Competing interest reported. FA is an equity owner in SIMA Surgical Intelligence Inc. PAK is a cofounder of Cascader Ltd. and has acted as a consultant for Retina Consultants of America, Roche, Boehringer Ingelheim, and Bitfount, and is an equity owner in Big Picture Medical. He has received speaker fees from Zeiss, Thea, Apellis, and Roche. He has received travel support from Bayer and Roche. He has attended advisory boards for Topcon, Bayer, Boehringer Ingelheim, and Roche. None of the other authors report any conflicts of interest.","formattedTitle":"Developing a scalable pipeline for data extraction from clinical letters through resource-efficient prompt engineering","fulltext":[{"header":"Introduction","content":"\u003cp\u003eThe adoption of electronic health records (EHRs) has revolutionised healthcare delivery by enabling rapid access to patient information and improving documentation quality.\u003csup\u003e\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e,\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u003c/sup\u003e However, as EHRs were primarily designed with clinical, billing, and administrative needs in mind,\u003csup\u003e3,4\u003c/sup\u003e secondary use for research presents a misalignment due to data quality and availability.\u003csup\u003e\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e\u003c/sup\u003e In addition, an estimated 80% of EHR data exists in an unstructured format (e.g. as clinical letters, discharge summaries, progress notes, imaging reports),\u003csup\u003e6\u003c/sup\u003e and manual data extraction remains a significant bottleneck due to the resource requirements and risk of human errors.\u003c/p\u003e \u003cp\u003eA scalable, systematic, and automated approach to data extraction is necessary to maximise the utility of this valuable resource. While natural language processing (NLP) techniques have shown promise, rule-based systems are brittle and do not generalise well, while traditional supervised machine learning (including early transformers such as BERT) require large datasets with task-specific annotations for fine-tuning.\u003csup\u003e\u003cspan additionalcitationids=\"CR8\" citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u003c/sup\u003e Large language models (LLMs) are generative artificial intelligence (AI) models trained on vast corpora of text that can understand and generate natural human-like language.\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e\u003c/sup\u003e Compared to traditional NLP methods, LLMs excel in tasks requiring contextual understanding and reasoning, and can adapt to nuances and variability in complex clinical narratives without requiring elaborate feature engineering or a large manually annotated training dataset.\u003c/p\u003e \u003cp\u003eWhile fine-tuning LLMs can improve performance, alternative techniques such as prompt engineering may also minimise the need for labeled training data, computational resources, and custom training for each new use case.\u003csup\u003e\u003cspan citationid=\"CR10\" class=\"CitationRef\"\u003e10\u003c/span\u003e,\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e\u003c/sup\u003e Previous work has demonstrated the potential of this approach in extracting data from pathology reports,\u003csup\u003e12\u003c/sup\u003e radiology reports,\u003csup\u003e13\u003c/sup\u003e and hospital admission records in emergency and acute psychiatry settings.\u003csup\u003e\u003cspan citationid=\"CR14\" class=\"CitationRef\"\u003e14\u003c/span\u003e,\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e However, beyond proof-of-concepts, deploying a high-performing model at scale is not a trivial task. Real-world deployment demands a more balanced consideration of accuracy that goes beyond sensitivity and specificity, given the impact of any upstream errors on downstream research outputs. In addition, a singular focus on performance is insufficient - a comprehensive validation framework that takes resource use and trade-offs into account is needed.\u003c/p\u003e \u003cp\u003eOne of the richest sources of unstructured clinical data is the corpus of clinical records relating to outpatient care, but this is relatively underexplored. Ophthalmology is an ideal specialty within which to evaluate this question - as the busiest outpatient specialty in the United Kingdom (UK), with close to 10\u0026nbsp;million outpatient appointments in England as of 2024-5,\u003csup\u003e16\u003c/sup\u003e it represents a rich source of untapped data from a high-volume specialty. In particular, ophthalmic clinical letters, a primary clinical artefact that summarises the clinical encounter and functions both as a care record and communication with primary care services, are characterised by a high density of specialist terminology and non-standardised abbreviations,\u003csup\u003e17\u003c/sup\u003e and typically contain a complex mix of unstructured narratives, semi-structured data, uncertainties that may be highly nuanced, and features such as laterality which are essential to clinical meaning, all without necessarily adhering to a standardised template.\u003c/p\u003e \u003cp\u003eIn this study, we develop a scalable pipeline for the complex task of extracting information from real-world ophthalmic clinical letters. We evaluate the extent to which out-of-the box LLMs can achieve a pre-specified and clinically reliable performance threshold, compare performance across frontier and locally deployable LLMs, and propose a multi-dimensional assessment to evaluate performance as well as operational efficiency, failure modes, and robustness for deployment on real-world data pipelines for downstream tasks. Using this approach, we select the optimal model to scale up our data extraction pipeline to a large dataset.\u003c/p\u003e"},{"header":"Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eTask definition\u003c/h2\u003e \u003cp\u003eWe aimed to demonstrate the development of an end-to-end pipeline for ophthalmic data extraction to support downstream research using real-world data, such as large-scale clinical validation of AI systems in the real world. For this proof-of-concept, we selected a deep learning algorithm for diagnosing and triaging patients with suspected macular disease as an exemplar.\u003csup\u003e\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eWe aimed to extract data on nine common macular conditions that the algorithm identifies from optical coherence tomography (OCT) scans - choroidal neovascularisation (CNV), macular oedema, central serous chorioretinopathy (CSCR), drusen, geographic atrophy (GA), epiretinal membrane (ERM), partial thickness macular hole (PTMH), full-thickness macular hole (FTMH), and vitreomacular traction (VMT) (as defined in Supplementary Table\u0026nbsp;1). This model has been tested against expert clinicians in a small highly-selected test set of 997 cases (diagnostic accuracy study),\u003csup\u003e18\u003c/sup\u003e but has not been compared against real-world clinician performance and real-world disease prevalence at scale (agreement study).\u003c/p\u003e \u003cp\u003eAs scaling up real-world testing is limited by the poor quality and availability of structured diagnosis data in EHR records, the goal of this study was to develop a pipeline that could reliably identify the presence of each disease class and the corresponding laterality at the time of the appointment from unstructured clinic letters. The study workflow is detailed in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eData source\u003c/h3\u003e\n\u003cp\u003eThis study used clinical letters from a retrospective cohort of adult patients aged\u0026thinsp;\u0026ge;\u0026thinsp;18 years who attended the retina service (medical retina and vitreoretinal clinics) at Moorfields Eye Hospital NHS Foundation Trust (MEH) from 1 February 2017 to 14 November 2024. MEH encompasses 30 networked centres serving a socioeconomically and ethnically diverse catchment of 6\u0026nbsp;million people across London in the UK, approximately 9% of the UK population.\u003c/p\u003e \u003cp\u003eAll clinical letters used in this study were written in English and were anonymised prior to analysis. This process employed an automated de-identification NLP pipeline (AnonCAT, Cogstack).\u003csup\u003e\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e\u003c/sup\u003e Protected health information (PHI) such as names, date of birth, identifiers (hospital or NHS numbers), location, and contact details were redacted in accordance with guidance issued by the UK Information Commissioner\u0026rsquo;s Office.\u003csup\u003e\u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e\u003c/sup\u003e This study was conducted as a service improvement project (2024/1568_v1) and adhered to the tenets of the Declaration of Helsinki.\u003c/p\u003e\n\u003ch3\u003eDevelopment dataset\u003c/h3\u003e\n\u003cp\u003eA development dataset of 600 letters (from 600 patients, all written in 2022) was developed. The dataset was constructed using a two-stage sampling strategy to ensure adequate representation of both common and rarer conditions. Initially, a consecutive sample of 400 clinical letters was labelled to mirror the real-world data distribution and disease complexity. Due to class imbalance (prevalence ranging from 1% in GA to 18% in macular oedema in this sample), four low-prevalence classes (CSCR, GA, PTMH, FTMH) were enriched with additional cases. For this enrichment dataset, we employed a hybrid retrieval method which combined a traditional keyword-based retriever (BM25)\u003csup\u003e\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e\u003c/sup\u003e with a semantic retriever (MedCPT)\u003csup\u003e\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e\u003c/sup\u003e. Rankings were integrated using reciprocal rank fusion with a decaying weight across multiple query terms to balance lexical and semantic relevance.\u003csup\u003e\u003cspan citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u003c/sup\u003e To mitigate potential bias from oversampling straightforward cases, this enrichment was stratified and incorporated a range of probabilities for each condition. This approach ensured statistically robust sensitivity estimates while maintaining feasibility for manual labeling. A minimum threshold of 40 positive cases per class was set to capture a range of case complexity and writing styles.\u003c/p\u003e\n\u003ch3\u003eLabelling guidelines and clinician labelling\u003c/h3\u003e\n\u003cp\u003eThe primary extraction task required the identification of two key pieces of information for each of the nine disease classes:\u003c/p\u003e \u003cp\u003e1. Status: defined as \u0026lsquo;Absent\u0026rsquo;, \u0026lsquo;Present\u0026rsquo;, or \u0026lsquo;Suspected\u0026rsquo;. In order to be defined as \u0026lsquo;Present\u0026rsquo;, the condition had to be documented as current at the time of the clinic visit, explicitly excluding historical mentions if the condition had fully resolved without further sequelae. \u0026lsquo;Suspected\u0026rsquo; referred to cases where the clinician expressed diagnostic uncertainty.\u003c/p\u003e \u003cp\u003e2. Laterality: defined as the corresponding laterality (left, right, both) for each condition.\u003c/p\u003e \u003cp\u003eTo support the development of a gold standard benchmark, labelling guidelines were initially developed by an ophthalmologist with eight years of clinical experience. This dataset was then labelled independently by two ophthalmologists who have practiced clinically for eight and six years. All discrepancies were resolved through a consensus review process to determine the final label for each case, with recourse to a senior ophthalmologist for arbitration where necessary. Agreement was tabulated using Cohen\u0026rsquo;s kappa (κ). This process also served to support the iterative refinement of the labelling guidelines to ensure clarity and consistency.\u003c/p\u003e\n\u003ch3\u003ePrompt engineering, refinement, and error taxonomy development\u003c/h3\u003e\n\u003cp\u003eThe prompt was constructed using a modular approach to prompt engineering, which started with a simple baseline prompt, to which additional components and refinements were iteratively added with the aim of improving performance. The final standardised prompt integrated the following modules in a sequential chain (Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e2\u003c/span\u003e):\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e1. Task definition specifying the core instructions (\u0026ldquo;Role\u0026rdquo; and \u0026ldquo;Task\u0026rdquo;) and the required JSON output schema (\u0026ldquo;JSON Schema and Instructions\u0026rdquo;); and\u003c/p\u003e \u003cp\u003e2. Specifying a requirement to summarise and provide the evidence for each extracted finding (Evidence rationalisation); and\u003c/p\u003e \u003cp\u003e3. Core principles from the initial labelling guideline, which discuss principles relevant across all disease; and\u003c/p\u003e \u003cp\u003e4. Description of conditions comprising a detailed description of each condition from the initial labelling guideline; and\u003c/p\u003e \u003cp\u003e5. Refined instructions derived from the iterative error analysis which were aimed at mitigating common failure modes. Prompt 5b rearranged the position of different modules (i.e. moving the letter from the bottom of the prompt to the top, immediately after the role and task description) to test the effect on model performance;\u003c/p\u003e \u003cp\u003e6. In-context learning (ICL), which included providing a few examples of specific tasks (formatted as per the required output JSON schema) in which errors remained despite the provision of refined instructions; or\u003c/p\u003e \u003cp\u003e7. Chain-of-thought (CoT) thinking, which involved asking the model to \u0026ldquo;think\u0026rdquo; through each step of the task in sequential order.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003eThe prompt structure is outlined in Supplementary Fig.\u0026nbsp;1. An example of a letter and outputs is presented in Supplementary Fig.\u0026nbsp;2.\u003c/p\u003e \u003cp\u003eGemini 1.5 Flash (Google) was selected as the baseline development model for the iterative prompt engineering and refinement phase. This was because of its unique balance of speed, cost, and performance which helps facilitate rapid and iterative testing. All testing was conducted in August 2025 (temperature 0, maximum tokens 8092). Access to all Gemini models was provisioned through a secure virtual machine hosted in a dedicated controlled research environment in Google Cloud Platform.\u003c/p\u003e \u003cp\u003eIterative refinement was performed with the aim of optimising the model outputs, using a pre-specified performance target of F1 score \u003cspan class=\"InlineEquation\"\u003e\u003cspan class=\"mathinline\"\u003e\\(\\:\\ge\\:\\)\u003c/span\u003e\u003c/span\u003e 0.95. The metric was chosen as it provides a balanced assessment of a model's precision and recall (sensitivity), since both are essential in ensuring the output is of sufficient quality to serve as a reliable \u0026ldquo;silver standard\u0026rdquo; for subsequent research and clinical validation tasks. F1 score was selected as the target over more commonly preferred metrics such as sensitivity and specificity in consideration of the need for balancing precision and recall for downstream tasks. In large-scale data extraction pipelines, false positives can introduce noise, whereas false negatives risk systematic omission of relevant clinical information. The F1 score therefore provides a more holistic assessment of model utility for real-world deployment, where performance must remain robust under varying prevalence and documentation styles.\u003c/p\u003e \u003cp\u003eErrors - defined as false negatives (FN, missed entities) or false positives (FP, e.g. incorrect extractions or spurious predictions) - were identified from these initial outputs. Analysis of these errors informed successive refinements to prompt structure and content, as well the development of an error taxonomy to support qualitative examination of model failure modes. Taxonomy development followed Braun and Clarke\u0026rsquo;s approach to inductive thematic analysis,\u003csup\u003e24\u003c/sup\u003e which involved data familiarisation, generation of initial error categories based on observed patterns, and iterative refinement into key higher-level themes and sub-themes through repeated analytic engagement.\u003c/p\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003eTesting for generalisability and efficiency\u003c/h2\u003e \u003cp\u003eAdditional testing was performed to assess the real-world applicability of this prompting strategy. Note that prompt development and refinement were conducted exclusively using the aforementioned development dataset. Once finalised, the prompts were frozen and evaluated without further modification on all external datasets and model families.\u003c/p\u003e \u003cp\u003eFirstly, cross-model robustness was tested by evaluating the prompting strategy on other models from the Gemini family (\u0026lsquo;Gemini 1.5 Pro\u0026rsquo;, \u0026lsquo;Gemini 2.5 Flash\u0026rsquo;, and \u0026lsquo;Gemini 2.5 Flash Updated\u0026rsquo;, an updated release issued by Google in September 2025\u003csup\u003e25\u003c/sup\u003e), followed by testing on foundational LLM backbones from 7 different open-weight (\u0026ldquo;local\u0026rdquo;) model families (Gemma/Medgemma, Llama, Mistral, DeepSeek, Phi, Cogito, GPT-oss) and across a range of model sizes within the same family. To facilitate the evaluation of local models in the context of constrained computational resources, inference for all local models was executed via the \u003cem\u003eollama\u003c/em\u003e framework\u003csup\u003e\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e\u003c/sup\u003e using 4-bit post-training quantisation, primarily using the Q4_K_M and Q_4_0 GGUF variants. Prior work has demonstrated that 4-bit quantisation offers an optimal trade-off between memory footprint and model performance.\u003csup\u003e\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e\u003c/sup\u003e All experiments were performed on a secure virtual machine, which were equipped with 1\u0026ndash;4 NVIDIA Tesla T4 GPUs for experiments conducted with local models. The full list of models, parameters, and computational resources are presented in Supplementary Table\u0026nbsp;2.\u003c/p\u003e \u003cp\u003eSecondly, performance of the primary LLM family was evaluated on an independent dataset of 300 letters randomly sampled from the wider dataset of 219,930 letters spanning 2017\u0026ndash;2024 (excluding 2022), which comprised the primary subspecialties tested in the \u0026ldquo;development dataset\u0026rdquo; (medical retina, vitreoretinal surgery) and included additional unseen ophthalmic subspecialties in which macular OCT scans may be used. This was to test robustness against potential data drift resulting from clinician turnover and/or evolving writing styles, changing disease prevalence or complexity, or other temporal factors.\u003c/p\u003e \u003cp\u003eA Pareto frontier analysis was conducted to identify the subset of LLMs that offer the optimal trade-off between performance (the micro-averaged F1 score to determine the overall value of each model) plotted against real-world considerations for scalability (mean cost and time per letter). The Pareto frontier represents the set of models for which no other model is superior on a specific metric. For the proprietary models (Gemini 1.5 Flash, 1.5 Pro, 2.5 Flash), cost was calculated using Gemini\u0026rsquo;s per-million-token pricing for input and output tokens. For local models, cost was calculated as a factor of GPU infrastructural costs and time. We established the Pareto frontiers across all models and for local models alone.\u003c/p\u003e \u003c/div\u003e\n\u003ch3\u003eError analysis\u003c/h3\u003e\n\u003cp\u003eAn LLM-as-a-judge configuration was developed to facilitate error classification according to the error taxonomy. A representative sample of errors drawn from all models across 25 letters was coded by an experienced ophthalmologist. In parallel, a LLM (Gemini 2.5 Flash, selected for its high performance and optimal speed-cost configuration) was configured as an automated classifier (\u0026ldquo;LLM judge\u0026rdquo;) for the same error set, and was provided with identical contextual inputs, including the original clinical letter, the human reference standard and reasoning, and the erroneous LLM response and reasoning trace. Agreement between the human expert and LLM judge was measured using Cohen\u0026rsquo;s kappa (κ). The error classification prompt was refined to achieve a strong level of agreement (κ\u0026thinsp;\u0026gt;\u0026thinsp;0.80)\u003csup\u003e28\u003c/sup\u003e against the human reference standard, whereupon the automated workflow was deployed to classify the full corpus of errors to characterise failure modes across all models and prompts.\u003c/p\u003e \u003cdiv id=\"Sec10\" class=\"Section2\"\u003e \u003ch2\u003eStatistical analysis\u003c/h2\u003e \u003cp\u003eDescriptive statistics were used to characterise the dataset. Counts and proportions were summarised for categorical variables, and means and standard deviation (SD) or medians and interquartile range (IQR) were reported for continuous variables after checking for normality using visual inspection of Q-Q plots.\u003c/p\u003e \u003cp\u003eEach letter could contain zero, one, or multiple classes simultaneously. This was therefore treated as a multi-label classification problem, and performance metrics were therefore calculated in a per-class, \u0026lsquo;one-vs-rest\u0026rsquo; manner, treating each class independently while allowing for multiple labels within the same clinical letter. Inter-rater agreement between both labellers was calculated using Cohen's κ on a per-class basis.\u003c/p\u003e \u003cp\u003ePerformance metrics were firstly computed on a per-class basis. This included sensitivity (recall), specificity, positive predictive value (PPV, or precision), negative predictive value (NPV), and F1 score. Micro- and macro-averaged scores were then computed for each metric to summarise performance across all classes. Macro-averaged scores weight all classes equally, while micro-averaged scores provide a size-weighted measure of overall performance, to ensure that rarer disease classes are adequately represented. Point estimates were accompanied by 95% confidence intervals (CIs) to quantify uncertainty, using the Wilson score method for proportion-based metrics or nonparametric bootstrap sampling (5000 iterations) as appropriate. All analyses were conducted in Python (v3.12.4). The \u003cem\u003estatsmodels\u003c/em\u003e and \u003cem\u003escikit-learn\u003c/em\u003e libraries were used to compute classification metrics and agreement, and to quantify uncertainty.\u003c/p\u003e \u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e \u003ch2\u003eDataset characteristics\u003c/h2\u003e \u003cp\u003eThe \u0026lsquo;development\u0026rsquo; dataset used for iterative prompt refinement comprised 600 clinic letters from 600 patients (median age 58 years [IQR 54\u0026ndash;77]; 55.5% female). There was strong inter-rater agreement between clinical experts using the initial labelling guidelines prior to arbitration and refinement (Cohen\u0026rsquo;s κ\u0026thinsp;=\u0026thinsp;0.834). The distribution of ground truth labels is shown in Supplementary Table\u0026nbsp;3.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003eIterative prompt refinement and performance evaluation\u003c/h2\u003e \u003cp\u003eFor diagnosis extraction, the iterative prompt engineering process, conducted on the development set using Gemini 1.5 flash, demonstrated moderately high baseline performance on prompt 1 (task instructions only), with a micro-averaged F1 of 0.850 (95% CI 0.826\u0026ndash;0.874). Model performance improved incrementally across the iterative prompt refinement cycle, with the most significant improvements seen following the integration of two key modules into the prompt - reasoning, which instructed the model to cite evidence from the text for each finding before classification (prompt 2), and refinement to the labelling guideline (prompt 5a-b), including revisions to the \u0026lsquo;core principles\u0026rsquo; and \u0026lsquo;description of conditions\u0026rsquo; sections based on the error analysis detailed in the next subsection. Conversely, incorporating more advanced prompt engineering techniques such as few shot learning (prompt 6) and chain-of-thought prompting (prompt 7) resulted in a slight performance degradation over the clear instructions informed by error analysis that were provided in prompts 5a-b (Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e3\u003c/span\u003eA). Per-class analysis showed that performance for most conditions improved sharply from prompt 4 to 5, although there was one condition (ERM) that achieved excellent performance from baseline (Supplementary Fig.\u0026nbsp;3).\u003c/p\u003e \u003cp\u003eThe top-performing prompt (prompt 5b) achieved macro- and micro-averaged F1 scores of 0.960 (95% CI 0.936\u0026ndash;0.975) and 0.954 (95% CI 0.941\u0026ndash;0.967) respectively, with sensitivity 0.99 [0.98-1.00]; specificity 0.99 [0.99\u0026ndash;0.99], PPV 0.92 [0.89\u0026ndash;0.94], NPV 1.00 [1.00\u0026ndash;1.00]). Seven out of nine classes reached a per-class F1 score exceeding the pre-specified threshold of 0.95. Sensitivities and specificities were high across the board and near-perfect in almost all cases (Supplementary Table\u0026nbsp;4).\u003c/p\u003e \u003cp\u003eThe same iterative refinement cycle was applied to two other models in the same family (1.5 Pro and 2.5 Flash), demonstrating a fairly similar trajectory of performance improvement, confirming that the principles of including detailed classification guidelines are generalisable across the same model family. Overall, Gemini 2.5 Flash achieved the strongest performance with a micro-averaged F1 of 0.98 (95% CI 0.97\u0026ndash;0.98), sensitivity 0.98 (0.96\u0026ndash;0.99), specificity 1.00 (1.00\u0026ndash;1.00), PPV 0.97 (0.95\u0026ndash;0.98), and NPV 1.00 (1.00\u0026ndash;1.00). All nine disease classes exceeded the pre-specified threshold.\u003c/p\u003e \u003cp\u003eLaterality extraction achieved strong performance across all prompts tested, with small variations between prompts that did not follow the trajectory for diagnosis extraction (Supplementary Table\u0026nbsp;5). Gemini 1.5 Flash achieved micro-F1 scores ranging from 0.960\u0026ndash;0.975, while Gemini 2.5 Flash achieved 0.979\u0026ndash;0.988 across prompts 1\u0026ndash;7.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003eGeneralisability across other models\u003c/h2\u003e \u003cp\u003ePerformance across seven additional LLM families and multiple model sizes (17 local LLMs in total) was evaluated to assess the interoperability of our approach. Overall, the iterative prompt refinement process demonstrated robustness across model families differing in scale and training strategy, although the magnitude of improvement varied between individual models and families. This robustness did not extend to specific models which tended to be smaller (typically \u0026lt;\u0026thinsp;10b parameters), where performance declined as prompt length and complexity increased. Results for micro-F1 scores for diagnosis extraction are summarised in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e and Fig.\u0026nbsp;\u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e3\u003c/span\u003eB, and all other metrics are presented in Supplementary Table\u0026nbsp;6.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eSummary of micro-averaged F1 scores with 95% confidence intervals (CI) for diagnosis for all models, categorised by prompt.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"9\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colspan=\"8\" nameend=\"c9\" namest=\"c2\"\u003e \u003cp\u003eMicro-averaged F1 scores (95% CI)\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003ePrompt 1\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePrompt 2\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003ePrompt 3\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003ePrompt 4\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003ePrompt 5a\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003ePrompt 5b\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003ePrompt 6\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\"\u003e \u003cp\u003ePrompt 7\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ecogito:8b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.773 (0.74\u0026ndash;0.80)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.742 (0.71\u0026ndash;0.77)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.759 (0.73\u0026ndash;0.79)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.816 (0.79\u0026ndash;0.84)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e0.825 (0.80\u0026ndash;0.85)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.772 (0.74\u0026ndash;0.80)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.767 (0.74\u0026ndash;0.79)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.780 (0.75\u0026ndash;0.81)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ecogito:14b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.831 (0.81\u0026ndash;0.86)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.868 (0.84\u0026ndash;0.89)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.865 (0.84\u0026ndash;0.89)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.888 (0.87\u0026ndash;0.91)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.835 (0.81\u0026ndash;0.86)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.847 (0.82\u0026ndash;0.87)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.830 (0.81\u0026ndash;0.85)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.787 (0.76\u0026ndash;0.81)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ecogito:32b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.882 (0.86\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.859 (0.83\u0026ndash;0.88)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.897 (0.88\u0026ndash;0.92)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.891 (0.87\u0026ndash;0.91)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.924 (0.91\u0026ndash;0.94)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e\u003cb\u003e0.930 (0.91\u0026ndash;0.94)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.927 (0.91\u0026ndash;0.94)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.881 (0.86\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003edeepseek-r1:14b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.876 (0.85\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.888 (0.87\u0026ndash;0.91)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.877 (0.85\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.897 (0.88\u0026ndash;0.92)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.925 (0.91\u0026ndash;0.94)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.922 (0.90\u0026ndash;0.94)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.932 (0.92\u0026ndash;0.95)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e\u003cb\u003e0.939 (0.92\u0026ndash;0.95)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003edeepseek-r1:32b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.881 (0.86\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.883 (0.86\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.886 (0.86\u0026ndash;0.91)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.895 (0.88\u0026ndash;0.92)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.934 (0.92\u0026ndash;0.95)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.942 (0.93\u0026ndash;0.96)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e\u003cb\u003e0.946 (0.93\u0026ndash;0.96)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.924 (0.91\u0026ndash;0.94)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003egemini-1.5-flash\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.850 (0.83\u0026ndash;0.87)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.873 (0.85\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.872 (0.85\u0026ndash;0.89)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.897 (0.88\u0026ndash;0.92)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.920 (0.90\u0026ndash;0.94)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e\u003cb\u003e0.954 (0.94\u0026ndash;0.97)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.949 (0.94\u0026ndash;0.96)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.918 (0.90\u0026ndash;0.93)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003egemini-1.5-pro\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.872 (0.85\u0026ndash;0.89)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.876 (0.85\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.865 (0.84\u0026ndash;0.89)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.893 (0.87\u0026ndash;0.91)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.947 (0.93\u0026ndash;0.96)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e\u003cb\u003e0.952 (0.94\u0026ndash;0.96)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.950 (0.94\u0026ndash;0.96)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.941 (0.93\u0026ndash;0.95)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003egemini-2.5-flash\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.891 (0.87\u0026ndash;0.91)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.890 (0.87\u0026ndash;0.91)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.882 (0.86\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.911 (0.89\u0026ndash;0.93)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.969 (0.96\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e\u003cb\u003e0.975 (0.96\u0026ndash;0.98)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.971 (0.96\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e\u003cb\u003e0.975 (0.96\u0026ndash;0.98)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003egemini-2.5-flash-updated (October 2025 update)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.885 (0.86\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.888 (0.87\u0026ndash;0.91)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.874 (0.85\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.908 (0.89\u0026ndash;0.93)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.966 (0.95\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e\u003cb\u003e0.978 (0.97\u0026ndash;0.99)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.971 (0.96\u0026ndash;0.98)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.962 (0.95\u0026ndash;0.97)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003egemma2:27b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.840 (0.81\u0026ndash;0.86)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.844 (0.82\u0026ndash;0.87)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.832 (0.81\u0026ndash;0.86)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.906 (0.89\u0026ndash;0.93)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.869 (0.85\u0026ndash;0.89)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.840 (0.82\u0026ndash;0.86)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.869 (0.85\u0026ndash;0.89)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.882 (0.86\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003egemma2:2b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.606 (0.57\u0026ndash;0.64)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.772 (0.74\u0026ndash;0.80)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e\u003cb\u003e0.781 (0.75\u0026ndash;0.81)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.762 (0.73\u0026ndash;0.79)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.564 (0.53\u0026ndash;0.60)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.641 (0.61\u0026ndash;0.67)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.564 (0.53\u0026ndash;0.60)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.679 (0.65\u0026ndash;0.71)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003egemma2:9b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.833 (0.81\u0026ndash;0.86)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.861 (0.84\u0026ndash;0.88)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.835 (0.81\u0026ndash;0.86)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.871 (0.85\u0026ndash;0.89)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.860 (0.84\u0026ndash;0.88)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.875 (0.85\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.831 (0.81\u0026ndash;0.85)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e\u003cb\u003e0.904 (0.88\u0026ndash;0.92)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003egpt-oss:20b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.902 (0.88\u0026ndash;0.92)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.906 (0.89\u0026ndash;0.92)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.907 (0.89\u0026ndash;0.93)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.924 (0.91\u0026ndash;0.94)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.945 (0.93\u0026ndash;0.96)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.951 (0.94\u0026ndash;0.96)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.938 (0.92\u0026ndash;0.95)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e\u003cb\u003e0.958 (0.94\u0026ndash;0.97)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ellama3.1:8b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.742 (0.71\u0026ndash;0.77)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.697 (0.67\u0026ndash;0.73)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.700 (0.67\u0026ndash;0.73)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.759 (0.73\u0026ndash;0.79)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.696 (0.67\u0026ndash;0.72)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.480 (0.45\u0026ndash;0.51)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.467 (0.44\u0026ndash;0.49)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.495 (0.47\u0026ndash;0.52)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ellama3.3:70b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.884 (0.86\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.886 (0.87\u0026ndash;0.91)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.874 (0.85\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.907 (0.89\u0026ndash;0.93)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e0.912 (0.89\u0026ndash;0.93)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.906 (0.89\u0026ndash;0.92)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.907 (0.89\u0026ndash;0.92)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.877 (0.86\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003emedgemma:27b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.870 (0.85\u0026ndash;0.89)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.869 (0.85\u0026ndash;0.89)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.864 (0.84\u0026ndash;0.89)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.898 (0.88\u0026ndash;0.92)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.913 (0.90\u0026ndash;0.93)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.911 (0.89\u0026ndash;0.93)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.886 (0.86\u0026ndash;0.91)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e\u003cb\u003e0.933 (0.92\u0026ndash;0.95)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003emistral-small3.2:24b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.877 (0.85\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.874 (0.85\u0026ndash;0.90)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.858 (0.83\u0026ndash;0.88)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.891 (0.87\u0026ndash;0.91)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.926 (0.91\u0026ndash;0.94)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e\u003cb\u003e0.937 (0.92\u0026ndash;0.95)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.932 (0.92\u0026ndash;0.95)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.921 (0.90\u0026ndash;0.94)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003emistral-small:22b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.805 (0.78\u0026ndash;0.83)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.821 (0.80\u0026ndash;0.84)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.846 (0.82\u0026ndash;0.87)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.899 (0.88\u0026ndash;0.92)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.890 (0.87\u0026ndash;0.91)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.811 (0.79\u0026ndash;0.84)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.817 (0.79\u0026ndash;0.84)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.811 (0.79\u0026ndash;0.83)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003emixtral:8x7b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.830 (0.81\u0026ndash;0.85)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.815 (0.79\u0026ndash;0.84)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.837 (0.81\u0026ndash;0.86)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.894 (0.88\u0026ndash;0.91)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.886 (0.86\u0026ndash;0.91)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.838 (0.81\u0026ndash;0.86)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.839 (0.81\u0026ndash;0.86)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.869 (0.85\u0026ndash;0.89)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ephi3:14b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.727 (0.70\u0026ndash;0.76)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.767 (0.74\u0026ndash;0.79)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.816 (0.79\u0026ndash;0.84)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e\u003cb\u003e0.819 (0.79\u0026ndash;0.84)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.742 (0.71\u0026ndash;0.77)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.688 (0.66\u0026ndash;0.72)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e0.673 (0.64\u0026ndash;0.70)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.638 (0.61\u0026ndash;0.67)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003ephi4:14b\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0.890 (0.87\u0026ndash;0.91)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e0.901 (0.88\u0026ndash;0.92)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e0.890 (0.87\u0026ndash;0.91)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e0.907 (0.89\u0026ndash;0.93)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e0.914 (0.90\u0026ndash;0.93)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e0.922 (0.90\u0026ndash;0.94)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c8\"\u003e \u003cp\u003e\u003cb\u003e0.932 (0.92\u0026ndash;0.95)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c9\"\u003e \u003cp\u003e0.923 (0.91\u0026ndash;0.94)\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThe top performing local models were: GPT-oss 20b (micro F1 0.96 [95% CI 0.94\u0026ndash;0.97]; sensitivity 0.97 [0.95\u0026ndash;0.98]; specificity 0.99 [0.99-1.00], PPV 0.95 [0.92\u0026ndash;0.96], NPV 1.00 [1.00\u0026ndash;1.00]); and Phi-4 14b (micro F1 0.93 [0.92\u0026ndash;0.95], sensitivity 0.97 [0.95\u0026ndash;0.98]; specificity 0.99 [0.99\u0026ndash;0.99], PPV 0.90 [0.87\u0026ndash;0.92], NPV 1.00 [1.00\u0026ndash;1.00]).\u003c/p\u003e \u003cp\u003eOverall, baseline performance in prompt 1 was moderately high across models (micro-F1\u0026thinsp;\u0026gt;\u0026thinsp;0.8 for models \u0026gt;\u0026thinsp;10b in size), and this tended to improve from prompts 1\u0026ndash;4. Performance tended to improve from prompt 4 to 5 (introduction of refined explanations following error analysis for the development model) in the majority of models, whereas smaller models from specific LLM families (Gemma-2 2b, Llama-3.1 8b, Mistral Small 22b, Mixtral 8x7b) saw a sharp performance drop.\u003c/p\u003e \u003cp\u003eTwo advanced prompt engineering techniques (ICL and CoT) were introduced in prompts 6 and 7 respectively, building on the error analysis from prompts 5a-b. For a subset of models, there was a small performance improvement with ICL (Deepseek-R1 32b, Phi-4 14b) and CoT (Deepseek-R1 14b, Gemma-2 9b, GPT-oss 20b, Medgemma 27b) over the error analyses in prompts 5a-b.\u003c/p\u003e \u003cp\u003ePrompts 5a and 5b differ only in the position of the clinical letter relative to the remainder of the prompt. However, this resulted in performance differences across all models. There was a meaningful performance improvement in the Gemini family models from 5a to b (e.g. micro-F1 0.920\u0026ndash;0.954 in Gemini 1.5 Flash, and 0.969\u0026ndash;0.975 in Gemini 2.5 Flash). This was also true for GPT-oss and the Gemma models, while the converse was true for the Llama models. Results were mixed for the Cogito, Mistral, Deepseek and Phi families.\u003c/p\u003e \u003cp\u003eLaterality extraction achieved strong performance across the additional LLM families. Exceptions included certain smaller models (Cogito 8b, Gemma-2 2b, Phi-3 14b) which achieved lower micro-F1 scores (\u0026lt;\u0026thinsp;0.90) across the board and which saw a small but perceptible performance decline from prompts 5a-7 (Supplementary Table\u0026nbsp;5).\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003eAnalysis of performance-cost trade-offs\u003c/h2\u003e \u003cp\u003eFigure \u003cspan refid=\"Fig5\" class=\"InternalRef\"\u003e4\u003c/span\u003eA-B shows the performance-cost and performance-time trade-offs (the \u0026ldquo;cost frontier\u0026rdquo; and \u0026ldquo;latency frontier\u0026rdquo;) across multiple LLM families and model sizes on Prompt 5b (highest performance prompt). The Pareto frontier was dominated by the Gemini family on both counts. When limited to local open-weight models only, the Pareto frontier was formed by gemma2-2b on the lower end and gpt-oss-20b on the higher end. Gemini 1.5 Flash was the most cost- and time-efficient model overall (Fig.\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e5\u003c/span\u003e).\u003c/p\u003e \u003cp\u003eResults across all prompts are displayed in Supplementary Figs.\u0026nbsp;5\u0026ndash;7. Overall, when analysed across all eight prompts, the Pareto frontier for all models was formed by the Gemini models on the lower end and two local models (phi4-14b and gpt-oss-20b) on the higher end from prompts 1\u0026ndash;4. Performance improvement diverged from prompt 5a onwards, with larger improvements seen in the Gemini models and smaller improvements in the two local models, meaning that these local models no longer formed part of the overall Pareto frontier.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003eError taxonomy\u003c/h2\u003e \u003cp\u003eThe error taxonomy and representative examples for each category are presented in Table\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e. Two key classes emerged from the analysis, which are described in brief below:\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003e\u003cb\u003eTaxonomy of errors, categorised as model-centric and data-centric errors.\u003c/b\u003e The table outlines the classification framework used for the analysis, detailing the eight identified error categories with definitions and illustrative examples from the clinical letters.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"3\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTheme\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eExamples\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"4\" rowspan=\"5\"\u003e \u003cp\u003eModel-centric errors\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTheme 1: Deficiencies in domain knowledge - specifically ophthalmic knowledge such as subtypes, anatomy\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eE.g. the classification task for \u0026lsquo;CNV\u0026rsquo; required an understanding that conditions such as idiopathic polypoidal choroidal vasculopathy and retinal angiomatous proliferation are subtypes of CNV.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTheme 2: Errors in clinical inference (following correct identification of diagnoses or findings)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eE.g. Correctly identifying the presence of intraretinal and/or subretinal fluid in the letter, but inferring this to represent \u0026lsquo;macular oedema\u0026rsquo; (swelling of the macula due to fluid build-up). In reality, macular oedema is a specific description for leakage-driven fluid build-up, whereas subretinal fluid is not specific to macular oedema, and can be seen in the context of multiple diseases including CSCR\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTheme 3: Linguistic errors, specifically coordination and reference errors, can occur when a statement lists multiple items, to which the model incorrectly links a modifier due to a failure to correctly parse complex grammatical structures.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eE.g. parsing the sentence \u0026ldquo;He was referred for a macular hole in his right eye and an epiretinal membrane in his left eye. The former was not evident and the latter was mild\u0026rdquo;, requires a multi-step logical inference that extends beyond simple pattern matching.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTheme 4: Misinterpretation of context.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eE.g. the model fails to recognise that a referral reason (e.g. \u0026ldquo;Mr X was referred by his opticians for a macular hole\u0026rdquo;) does not constitute a current and definitive diagnosis, particularly as subsequent findings from a clinical examination, objective test, or the final impression may not support this.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTheme 5: Hallucinations, where the model fabricates information (such as diagnoses or findings) that is not present in the text\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eData-centric errors\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTheme 6: Conflicting information was occasionally present in the source text.\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eTranscription errors, e.g. where the problem list (which may be automatically copied over from a previous clinical encounter) might state \u0026ldquo;mild non-proliferative diabetic retinopathy with macular oedema\u0026rdquo;, whereas the description of a test such as an optical coherence tomography (OCT) scan later in the same letter might report \u0026ldquo;macula dry, exudates only\u0026rdquo;, which would rule out the presence of macular oedema and should carry a higher epistemic weight due to its objective nature.\u003c/p\u003e \u003cp\u003eTemporality, e.g. where the problem list might state \u0026lsquo;full-thickness macular hole\u0026rsquo; and be followed by a further surgical procedure (e.g. \u0026lsquo;PPV/ ILM peel/ gas\u0026rsquo;) to treat the condition, and would typically imply surgical success unless specified otherwise. This would therefore mean that the condition was no longer present at the time of the clinical visit.\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTheme 7: Clinical uncertainty and ambiguity or hedging in the source text\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003eSome clinicians used phrases such as \u0026ldquo;suspicious for,\u0026rdquo; \u0026ldquo;features suggestive of,\u0026rdquo; \u0026ldquo;possible early\u0026hellip;\u0026rdquo;, or \u0026ldquo;query [condition]\u0026rdquo; to propose a possible diagnosis which would require further monitoring or investigation, while clearly favouring a specific diagnosis in the text. In addition, a clinical letter might list several potential differential diagnoses to explain the findings (e.g. \u0026ldquo;subretinal fluid could represent either CSCR or early CNV\u0026rdquo;).\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eFirstly, predominantly model-centric errors, which reflect deficiencies in the LLM\u0026rsquo;s domain knowledge (Theme 1, e.g. not knowing that idiopathic polypoidal choroidal vasculopathy is a subtype of CNV); errors in reasoning for clinical inference (Theme 2, e.g. incorrectly inferring that subretinal fluid meant macular oedema); failure to parse linguistic structures (Theme 3, e.g. incorrectly linking modifiers in coordinated sentences); misinterpretation of context (Theme 4, e.g. mistaking a referral reason for a definitive diagnosis); or \u0026ldquo;hallucinations\u0026rdquo;, where models confabulate or fabricate non-existent information such as diagnoses and findings (Theme 5).\u003c/p\u003e \u003cp\u003eSecondly, data-centric errors, which arise from the inherent complexities and ambiguities of the clinical letters themselves. This includes conflicting information arising from competing diagnoses (most often due to transcription errors) or temporal differences where a condition had been successfully treated but both the original diagnosis and the treatment was included (Theme 6), as well as clinical ambiguity or uncertainty, such as the use of hedging phrases like \u0026ldquo;suspicious for\u0026rdquo; or listing multiple differential diagnoses while clearly favouring a specific diagnosis in the text (Theme 7).\u003c/p\u003e \u003cp\u003eInformed by these key themes, the prompt was refined to mitigate errors identified in the error analysis (Prompt 5a-b). A detailed description of each condition and its clinical meaning was provided to address theme 1 and 2. To address Theme 6, an evidence hierarchy was added to the core principles, prioritising objective investigations such as OCT scans, followed by examination findings, then by the problem list. Instructions to disregard referral reasons were included given that discrepancies between referral reasons and the final specialist diagnosis was not uncommon. The instructions on status were revised for clarity to address the errors in Theme 7.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003eError analysis\u003c/h2\u003e \u003cp\u003eA random sample of 25 letters (representing 512 false positives and false negatives across all models and prompts) was drawn. These errors were classified independently by an ophthalmologist (AYO) and an \u0026ldquo;LLM judge\u0026rdquo; according to the error taxonomy. Substantial agreement was achieved between the human and LLM judge (Cohen\u0026rsquo;s κ 0.909) (Supplementary Fig.\u0026nbsp;8). The optimised prompt was then applied to the complete dataset of model-generated errors to quantify their distribution and observe patterns and trends across models and prompts.\u003c/p\u003e \u003cp\u003eThe distribution of error themes by model and prompt are presented in Supplementary Figs.\u0026nbsp;9\u0026ndash;10. The most common errors were errors in domain knowledge (theme 1) and clinical inference (theme 2) as well as handling conflicting information within the clinical text (theme 6). These errors were mostly addressed with more directed prompting but could not be completely eliminated. While prompts 5a-5b were designed to address the most common errors in the development model (Gemini 1.5 Flash), the benefits of this prompting strategy extended across other models/ model families. However, this pattern did not extend to smaller models (\u0026lt;\u0026thinsp;10b parameters) - error rates (particularly hallucinations) increased with length and depth of detail provided in the prompt. Hallucinations were far rarer in large local and proprietary models.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec18\" class=\"Section2\"\u003e \u003ch2\u003eGeneralisability in external validation\u003c/h2\u003e \u003cp\u003eTo validate the final pipeline, the top-performing prompt (prompt 5b) was applied to an unseen dataset of 300 letters. The pipeline demonstrated strong generalisability for diagnosis detection across all 9 classes in the same LLM family: Gemini 1.5 Flash (micro-averaged F1 0.945, 95% CI 0.920\u0026ndash;0.966), Gemini 1.5 Pro (0.960, 95% CI 0.939\u0026ndash;0.978), and Gemini 2.5 Flash (0.980, 95% CI 0.964\u0026ndash;0.993). The same was true for laterality: Gemini 1.5 Flash (0.941, 95% CI 0.904\u0026ndash;0.973), Gemini 1.5 Pro (0.912, 95% CI 0.870\u0026ndash;0.948), and Gemini 2.5 Flash (0.974, 95% CI 0.948\u0026ndash;0.995).\u003c/p\u003e \u003cp\u003eGemini 2.5 Flash was selected as the optimal configuration for scaling up the pipeline to the full dataset because of the optimal balance of performance, cost, and efficiency identified from the above experiments: micro-F1 0.975 (95% CI 0.965\u0026ndash;0.984); USD 0.00199/ letter; 7.8s (IQR 6.81\u0026ndash;9.20) per letter.\u003c/p\u003e \u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eIn this study, we developed a structured framework for clinical information extraction from real-world ophthalmic letters, which combines iterative prompt refinement, rigorous error analysis, and operational benchmarking with the aim of providing generalisable insights. We demonstrate that LLMs can achieve a pre-specified high-performance threshold for extracting ophthalmic diagnoses and laterality from unstructured clinical letters without the need for elaborate feature engineering. Using this approach, Gemini models achieved our pre-specified target of micro-average F1\u0026thinsp;\u0026ge;\u0026thinsp;0.95 from Prompt 5b onwards, and a local open-weight model (gpt-oss-20b) reached comparable performance from Prompt 5a onwards. Our work was grounded in a pragmatic examination of the trade-off between model performance and costs to develop a scalable and resource-efficient pipeline. Further contributions include an error taxonomy for data extraction from clinical letters.\u003c/p\u003e \u003cdiv id=\"Sec20\" class=\"Section2\"\u003e \u003ch2\u003ePrompt engineering techniques and iterative prompt refinement\u003c/h2\u003e \u003cp\u003eInformation extraction is a classic NLP task which has traditionally relied on rule-based systems or fine-tuned encoder models such as BERT and its derivatives.\u003csup\u003e\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e\u003c/sup\u003e A recent scoping review of LLM-based approaches to information extraction from radiology reports found that the majority (28/34, 82%) used BERT-based models,\u003csup\u003e30\u003c/sup\u003e although there is increasing interest in employing generative LLMs for clinical information extraction. While some studies report that LLM performance improves with fine-tuning,\u003csup\u003e31,32\u003c/sup\u003e recent work suggests that prompt engineering alone may suffice in this regard.\u003csup\u003e\u003cspan additionalcitationids=\"CR13 CR14\" citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR15\" class=\"CitationRef\"\u003e15\u003c/span\u003e\u003c/sup\u003e However, testing on a realistic spectrum of real-world data is essential, as small datasets or synthetic data limits clinical applicability. \u003csup\u003e33\u003c/sup\u003e\u003c/p\u003e \u003cp\u003eA key insight from our work is that familiarity with the behaviour of the model or model family is important. While the methodology applied the same broad prompt refinement strategy across different model architectures, different models possessed distinct sensitivities, such as tolerance to prompt length, susceptibility to \u0026ldquo;context rot\u0026rdquo;, and response to text positioning. For example, we showed that performance in models from the Gemini family improved when the clinical letter was introduced immediately after the role and task in the prompt, whereas this was the opposite for the Llama and Mistral models. This is analogous to previous work exploring model context length, which found that specific LLMs performed better at extracting data from the very beginning (primacy bias) and very end (recency bias) of their input prompt context windows,\u003csup\u003e34\u003c/sup\u003e and extends work done on short single sentence prompts which found that prompt position does matter.\u003csup\u003e\u003cspan citationid=\"CR35\" class=\"CitationRef\"\u003e35\u003c/span\u003e\u003c/sup\u003e Awareness of these distinctions is essential when designing and optimising prompts for specific models and tasks.\u003c/p\u003e \u003cp\u003eIn addition, advanced prompt engineering techniques did not always improve model performance. Some models (e.g. Deepseek-R1 32b, GPT-oss 20b, Phi-4 14b) showed improvement with few-shot learning, though others (e.g. Gemini 1.5 Flash, 1.5 Pro, 2.5 Flash) demonstrated slight performance deterioration. This heterogeneity reinforces the importance of empirical prompt testing, even within the same model family, rather than assuming more elaborate prompting yields better results.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec21\" class=\"Section2\"\u003e \u003ch2\u003eThe nature of real-world clinical texts and the need for error analysis\u003c/h2\u003e \u003cp\u003eOverall, our findings support the use of iterative prompt refinement as a systematic and reproducible approach for improving model reliability and interpretability. Beyond improving model performance, this can also serve as a diagnostic tool for understanding model behaviour and generalisation. However, as prompts interact closely with the data they are applied to, model performance ultimately depends on the irregularities, ambiguities, and inconsistencies inherent in real-world clinical text.\u003c/p\u003e \u003cp\u003eError analysis showed that model performance was influenced by both inherent model limitations (\u0026ldquo;model-centric errors\u0026rdquo;) as well as the complexity of the data itself (\u0026ldquo;data-centric errors\u0026rdquo;). The recurring failure modes identified included gaps in domain knowledge and errors in clinical inference, contextual misattribution, and lexical ambiguity. These findings suggest that LLMs are not only constrained by their training data but also by the inherent irregularity and contextual richness of real-world clinical documentation itself. These often contain idiosyncratic shorthand, inconsistent formatting and abbreviations, and implicit reasoning that challenge deterministic parsing,\u003csup\u003e36\u003c/sup\u003e highlighting the need for model testing on a range of real-world data reflective of clinical realities beyond highly-curated benchmark datasets\u003csup\u003e\u003cspan citationid=\"CR37\" class=\"CitationRef\"\u003e37\u003c/span\u003e\u003c/sup\u003e and reinforcing the importance of systematic error analysis.\u003c/p\u003e \u003cp\u003eFor example, practicing clinicians will recognise that discrepancies frequently exist between the referrer\u0026rsquo;s diagnosis and the specialist\u0026rsquo;s final diagnosis.\u003csup\u003e\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e,\u003cspan citationid=\"CR39\" class=\"CitationRef\"\u003e39\u003c/span\u003e\u003c/sup\u003e A recent ophthalmology audit found that the referral reasons for OCT abnormalities provided by primary eye care professionals differed from the final diagnosis made by a consultant ophthalmologist in 61.2% of cases.\u003csup\u003e\u003cspan citationid=\"CR38\" class=\"CitationRef\"\u003e38\u003c/span\u003e\u003c/sup\u003e In addition, conflicting information can frequently arise from transcription or documentation errors in busy high-throughput clinical environments. This is often less consequential for clinicians, who can resolve these conflicts through contextual knowledge and understanding, but poses a significant challenge for LLMs. Our systematic error analysis subsequently allowed us to articulate an 'evidence hierarchy' that mirrors the clinician's thinking process for conflict resolution.\u003c/p\u003e \u003cp\u003eIn this study, quantifying and qualifying these errors through a structured error taxonomy facilitated targeted remediation via iterative prompt refinement. Recognising and systematising these error patterns moves evaluation beyond global performance metrics toward interpretability - understanding why errors occur and where they matter most. This approach supports safer deployment by focusing attention on high-risk failure modes and prioritising quality improvement at the interface between language, context, and clinical meaning. However, there are a small proportion of errors that cannot be addressed through prompt refinement alone. Addressing epistemic uncertainty from the model\u0026rsquo;s intrinsic knowledge gaps may in some cases require fine-tuning or domain adaptation, although this must be weighed against the operational simplicity and generalisability of prompt-based optimisation.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec22\" class=\"Section2\"\u003e \u003ch2\u003eImplications for real-world deployment\u003c/h2\u003e \u003cp\u003eLLMs\u0026rsquo; information extraction capabilities can be leveraged to detect adverse events and medication-related outcomes from EHR records, which holds significant potential for enhancing downstream pharmacovigilance and pharmaco-epidemiology tasks to support postmarket surveillance of medical products.\u003csup\u003e\u003cspan citationid=\"CR41\" class=\"CitationRef\"\u003e41\u003c/span\u003e\u003c/sup\u003e In this study, we extend this proposal to show how LLMs could be used to support registry curation, large-scale audit, or algorithmovigilance\u003csup\u003e\u003cspan citationid=\"CR42\" class=\"CitationRef\"\u003e42\u003c/span\u003e\u003c/sup\u003e (monitoring of AI embedded in healthcare systems for clinical care post-deployment), which could take the form of institutional dashboards overseen by AI safety committees to ensure oversight aligns with regulatory obligations and clinical accountability.\u003c/p\u003e \u003cp\u003eFrom our experience, there are several practical challenges that need to be addressed in order to fully realise this potential. While high performance is a fundamental requirement, the operational costs and time required to achieve this is perhaps equally important for large-scale deployment. The Pareto frontier is a concept adopted from economics,\u003csup\u003e43\u003c/sup\u003e which takes into account the trade-offs between two objectives (e.g. cost and accuracy) to identify the set of optimal solutions for a multi-objective optimisation problem. It provides a pragmatic framework for examining these trade-offs, identifying models that deliver near-optimal performance relative to their computational and financial burden. Application to medical AI tasks has been limited, although a recent study comparing LLM performance in medical question answering has employed this to identify Pareto efficient configurations.\u003csup\u003e\u003cspan citationid=\"CR44\" class=\"CitationRef\"\u003e44\u003c/span\u003e\u003c/sup\u003e In this study, we extend its application to the more granular and complex task of multi-class clinical data extraction from real-world clinical letters, in addition to attempting to synthesise the trade-offs between three factors that are critical to LLM deployment on real-world clinical data (performance vs cost and latency), rather than the standard two. Balancing these factors is a strategic decision shaped by institutional priorities, financial resources, and infrastructural constraints. The Pareto frontier analysis was helpful here in determining the optimal model to take forward for deployment at scale, as the capacity to process thousands of records per hour may sometimes be more valuable than marginal improvements in accuracy, for example.\u003c/p\u003e \u003cp\u003eFurther practical considerations include institutional policies, which often determine whether and which proprietary models can be used, while computational infrastructure may limit the feasibility of large locally-deployed LLMs, favouring smaller alternatives which may have performance limitations depending on the specific task. Furthermore, reliance on proprietary frontier LLMs introduces risk of performance or algorithmic drift due to unannounced model updates,\u003csup\u003e45,46\u003c/sup\u003e which led us to test the performance and the interoperability of our approach on both frontier and local models available to us. In addition, data drift over time can arise from evolving documentation or coding conventions, which can degrade performance. Overall, these factors underscore the need for continuous monitoring and error analysis within a structured surveillance framework to support safe deployment at scale.\u003c/p\u003e \u003cdiv id=\"Sec23\" class=\"Section3\"\u003e \u003ch2\u003eFuture work\u003c/h2\u003e \u003cp\u003eFuture work will focus on operationalising this methodology into a clinician-friendly code-free software platform that can be seamlessly integrated into real-world clinical and research workflows. Although our results demonstrated strong performance across multiple models and model families, the probabilistic nature of LLMs may necessitate additional safeguards for large-scale deployment. This may include developing uncertainty metrics with human-in-the-loop review of cases flagged as low confidence. For fully automated applications, scalable audit mechanisms such as random or stratified sampling over time or subgroup analyses will be essential for monitoring the data extraction pipeline prospectively to detect drift and maintain reliability. Finally, extending beyond classification toward richer clinical concept extraction and normalisation (e.g. mapping clinician-described conditions to standard ontologies) could broaden downstream utility while preserving interpretability. Balancing this increased semantic depth with the complexity of evaluation as well as computational efficiency for operationalisation will support the next phase of real-world implementation.\u003c/p\u003e \u003c/div\u003e \u003c/div\u003e \u003cdiv id=\"Sec24\" class=\"Section2\"\u003e \u003ch2\u003eStrengths and limitations\u003c/h2\u003e \u003cp\u003eStrengths of this study include the rigorous and clinically-grounded design led by domain experts familiar with the nuances of ophthalmic documentation and the operational realities of clinical workflows. Dataset creation was deliberately structured to balance real-world sampling with enrichment to minimise the impact of class imbalance, supported by gold standard labelling and external validation to confirm generalisability. Secondly, our evaluation framework was designed to identify the model of greatest utility and to move beyond standard accuracy metrics to assess operational efficiency by employing a Pareto analysis to formally identify models with the optimal value. Finally, we have conducted a qualitative error analysis and developed an error taxonomy for ophthalmic clinical letters, which facilitated the identification of model failure modes and the evaluation of how inherent ambiguities in clinical documentation drive these failures - insights which are transferable to other clinical data extraction domains.\u003c/p\u003e \u003cp\u003eIn terms of limitations, we focused on a single language (English) as this is the predominant language in the UK. Despite external validation across subspecialties and time, our findings are based on clinical letters from a single institution. However, as Moorfields is the largest eye hospital in the UK, comprising 27 networked sites, the dataset captures a significant diversity of patient populations and clinicians, including writing styles. This inherent heterogeneity provides a degree of robustness and makes our findings more representative than typical single-site studies, although generalisability to other countries and healthcare systems should be tested as well. Nevertheless, the conceptual framework used to develop the pipeline should be broadly applicable. In addition, while our use of a standardised prompt was necessary for a fair comparison, this may not reflect the peak performance that each model could achieve with a tailored prompt. However, it provides a more pragmatic measure of a model's \u0026ldquo;out-of-the-box\u0026rdquo; usability and the portability of our prompt engineering, a critical factor for real-world deployment given the unfeasibility of designing and maintaining numerous bespoke prompts. Finally, given the rapid pace of development in LLMs, our performance ratings should be considered a snapshot in time, although we believe the framework itself has enduring value.\u003c/p\u003e \u003c/div\u003e"},{"header":"Conclusion","content":"\u003cp\u003eWe have demonstrated that a structured and iterative approach to prompt refinement can be used to efficiently leverage LLMs for real-world clinical information extraction at scale, thereby transforming unstructured text into structured high-fidelity data for downstream tasks. Through employing systematic error characterisation and Pareto frontier analyses for cost and latency, we reframe the evaluation of LLMs from a narrow focus on performance to a broader operational perspective - essential considerations that determine whether these systems can be scaled safely and in a resource-efficient manner. This framework may help lay the foundation for next-generation data pipelines that can accelerate scientific discovery and power continuous learning health systems.\u003c/p\u003e"},{"header":"Abbreviations","content":"\u003cp\u003eAI, artificial intelligence\u003c/p\u003e\n\u003cp\u003eCSCR, central serous chorioretinopathy\u003c/p\u003e\n\u003cp\u003eCNV, choroidal neovascularisation\u003c/p\u003e\n\u003cp\u003eEHR, electronic health records\u003c/p\u003e\n\u003cp\u003eERM, epiretinal membrane\u003c/p\u003e\n\u003cp\u003eFTMH, full-thickness macular hole\u003c/p\u003e\n\u003cp\u003eGA, geographic atrophy\u003c/p\u003e\n\u003cp\u003eLLM, large language model\u003c/p\u003e\n\u003cp\u003eNLP, natural language processing\u003c/p\u003e\n\u003cp\u003eOCT, optical coherence tomography\u003c/p\u003e\n\u003cp\u003eVMT, vitreomacular traction\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003eThis study was approved by the Moorfields Audit committee.\u003c/p\u003e\u003ch3\u003eAuthor contributions\u003c/h3\u003e\n\u003cp\u003eAYO conceptualised and coordinated the study, and performed data acquisition (together with IB), analysis and interpretation. AYO and QNN developed and executed the programming code, with QNN providing technical direction and implementation support. AYO prepared the first draft of the manuscript, which was critically reviewed and revised by all authors (AYO, QN, IB, JE, FA, MS, DAM, LJ, ED, YZ, GM, YT, AKD, PAK), who have read and approved the final manuscript.\u003c/p\u003e\n\u003ch3\u003eAcknowledgements\u003c/h3\u003e\n\u003cp\u003eAYO is supported by a National Institute for Health Research (NIHR) - Moorfields Eye Charity (MEC) Doctoral Fellowship (NIHR303691). PAK is supported by a UK Research \u0026amp; Innovation Future Leaders Fellowship (MR/T019050/1), Moorfields Eye Charity with The Rubin Foundation Charitable Trust (GR001753), and an Alcon Research Institute Senior Investigator Award. The views expressed in this publication are those of the authors and not necessarily those of the abovementioned funding bodies. We also thank Dr Siegfried K Wagner for his comments on a previous version of the manuscript.\u0026nbsp;\u003c/p\u003e\n\u003ch3\u003eCompeting interests\u003c/h3\u003e\n\u003cp\u003eFA is an equity owner in SIMA Surgical Intelligence Inc. PAK is a cofounder of Cascader Ltd. and has acted as a consultant for Retina Consultants of America, Roche, Boehringer Ingelheim, and Bitfount, and is an equity owner in Big Picture Medical. He has received speaker fees from Zeiss, Thea, Apellis, and Roche. He has received travel support from Bayer and Roche. He has attended advisory boards for Topcon, Bayer, Boehringer Ingelheim, and Roche. None of the other authors report any conflicts of interest.\u0026nbsp;\u003c/p\u003e\n\u003ch3\u003eCode availability\u003c/h3\u003e\n\u003cp\u003eThe code used in this study is not publicly available. It was developed for use within a secure clinical computing environment and contains components specific to local data structures and governance requirements. The code may be made available for academic review upon reasonable request, subject to institutional approval.\u003c/p\u003e\n\u003ch3\u003eData availability\u003c/h3\u003e\n\u003cp\u003eThe source data consist of routinely collected clinical letters containing sensitive patient information and cannot be shared publicly due to information governance constraints. All summary data, aggregated results, and analyses required to interpret the findings are provided in the manuscript and supplementary materials.\u003cbr\u003e\u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eJamieson T, Ailon J, Chien V, Mourad O. An electronic documentation system improves the quality of admission notes: a randomized trial. \u003cem\u003eJ Am Med Inform Assoc\u003c/em\u003e. 2017;24(1):123\u0026ndash;129. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1093/jamia/ocw064\u003c/span\u003e\u003cspan address=\"10.1093/jamia/ocw064\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAmarasingham R, Plantinga L, Diener-West M, Gaskin DJ, Powe NR. Clinical information technologies and inpatient outcomes: a multiple hospital study. \u003cem\u003eArch Intern Med\u003c/em\u003e. 2009;169(2):108\u0026ndash;114. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1001/archinternmed.2008.520\u003c/span\u003e\u003cspan address=\"10.1001/archinternmed.2008.520\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHolmes JH, Beinlich J, Boland MR, et al. Why Is the Electronic Health Record So Challenging for Research and Clinical Care? \u003cem\u003eMethods Inf Med\u003c/em\u003e. 2021;60(1\u0026ndash;02):32\u0026ndash;48. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1055/s-0041-1731784\u003c/span\u003e\u003cspan address=\"10.1055/s-0041-1731784\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCowie MR, Blomster JI, Curtis LH, et al. Electronic health records to facilitate clinical research. \u003cem\u003eClin Res Cardiol\u003c/em\u003e. 2017;106(1):1\u0026ndash;9. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/s00392-016-1025-6\u003c/span\u003e\u003cspan address=\"10.1007/s00392-016-1025-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBotsis T, Hartvigsen G, Chen F, Weng C. Secondary Use of EHR: Data Quality Issues and Informatics Opportunities. \u003cem\u003eSummit Transl Bioinform\u003c/em\u003e. 2010;2010:1\u0026ndash;5.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKong HJ. Managing Unstructured Big Data in Healthcare System. \u003cem\u003eHealthc Inform Res\u003c/em\u003e. 2019;25(1):1\u0026ndash;2. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.4258/hir.2019.25.1.1\u003c/span\u003e\u003cspan address=\"10.4258/hir.2019.25.1.1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWu H, Wang M, Wu J, et al. A survey on clinical natural language processing in the United Kingdom from 2007 to 2022. \u003cem\u003enpj Digit Med\u003c/em\u003e. 2022;5(1):1\u0026ndash;15. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s41746-022-00730-6\u003c/span\u003e\u003cspan address=\"10.1038/s41746-022-00730-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFu S, Chen D, He H, et al. Clinical concept extraction: A methodology review. \u003cem\u003eJournal of Biomedical Informatics\u003c/em\u003e. 2020;109:103526. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1016/j.jbi.2020.103526\u003c/span\u003e\u003cspan address=\"10.1016/j.jbi.2020.103526\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBazoge A, Morin E, Daille B, Gourraud PA. Applying Natural Language Processing to Textual Data From Clinical Data Warehouses: Systematic Review. \u003cem\u003eJMIR Medical Informatics\u003c/em\u003e. 2023;11(1):e42477. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.2196/42477\u003c/span\u003e\u003cspan address=\"10.2196/42477\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNaveed H, Khan AU, Qiu S, et al. A Comprehensive Overview of Large Language Models. \u003cem\u003earXiv\u003c/em\u003e. Preprint posted online October 17, 2024:arXiv:2307.06435. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2307.06435\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2307.06435\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAyhan MS, Ong AY, Ruffell E, Wagner SK, Merle DA, Keane PA. In-context learning for data-efficient classification of diabetic retinopathy with multimodal foundation models. \u003cem\u003emedRxiv\u003c/em\u003e. Preprint posted online March 10, 2025:\u003cdiv class=\"ExternalRefDOI\"\u003e2025.03.09.25323618\u003c/div\u003e. doi:10.1101/2025.03.09.25323618\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eHein D, Christie A, Holcomb M, et al. Iterative refinement and goal articulation to optimize large language models for clinical information extraction. \u003cem\u003enpj Digit Med\u003c/em\u003e. 2025;8(1):301. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s41746-025-01686-z\u003c/span\u003e\u003cspan address=\"10.1038/s41746-025-01686-z\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWihl J, Rosenkranz E, Schramm S, et al. Data extraction from free-text stroke CT reports using GPT-4o and Llama-3.3-70B: the impact of annotation guidelines. \u003cem\u003eEuropean Radiology Experimental\u003c/em\u003e. 2025;9(1):61. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s41747-025-00600-2\u003c/span\u003e\u003cspan address=\"10.1186/s41747-025-00600-2\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWiest IC, Ferber D, Zhu J, et al. Privacy-preserving large language models for structured medical information retrieval. \u003cem\u003enpj Digit Med\u003c/em\u003e. 2024;7(1):257. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s41746-024-01233-2\u003c/span\u003e\u003cspan address=\"10.1038/s41746-024-01233-2\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWiest IC, Verhees FG, Ferber D, et al. Detection of suicidality from medical text using privacy-preserving large language models. \u003cem\u003eBr J Psychiatry\u003c/em\u003e. 225(6):532\u0026ndash;537. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1192/bjp.2024.134\u003c/span\u003e\u003cspan address=\"10.1192/bjp.2024.134\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNHS Digital. Hospital Outpatient Activity 2019-20. NHS Digital. 2020. Accessed May 27, 2021. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://digital.nhs.uk/data-and-information/publications/statistical/hospital-outpatient-activity/2019-20/summary-report---treatment-specialities\u003c/span\u003e\u003cspan address=\"https://digital.nhs.uk/data-and-information/publications/statistical/hospital-outpatient-activity/2019-20/summary-report---treatment-specialities\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRadell JE, Tatum JN, Lin CT, et al. Risks and rewards of increasing patient access to medical records in clinical ophthalmology using OpenNotes. \u003cem\u003eEye\u003c/em\u003e. 2022;36(10):1951\u0026ndash;1958. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s41433-021-01775-9\u003c/span\u003e\u003cspan address=\"10.1038/s41433-021-01775-9\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDe Fauw J, Ledsam JR, Romera-Paredes B, et al. Clinically applicable deep learning for diagnosis and referral in retinal disease. \u003cem\u003eNat Med\u003c/em\u003e. 2018;24(9):9. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s41591-018-0107-6\u003c/span\u003e\u003cspan address=\"10.1038/s41591-018-0107-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKraljevic Z, Shek A, Yeung JA, et al. Validating Transformers for Redaction of Text from Electronic Health Records in Real-World Healthcare. In: \u003cem\u003e2023 IEEE 11th International Conference on Healthcare Informatics (ICHI)\u003c/em\u003e. 2023:544\u0026ndash;549. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1109/ICHI57859.2023.00098\u003c/span\u003e\u003cspan address=\"10.1109/ICHI57859.2023.00098\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eInformation Commissioner\u0026rsquo;s Office. What is personal data? November 27, 2024. Accessed December 19, 2024. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal-information-what-is-it/what-is-personal-data/what-is-personal-data/\u003c/span\u003e\u003cspan address=\"https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal-information-what-is-it/what-is-personal-data/what-is-personal-data/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eRobertson S, Zaragoza H. The Probabilistic Relevance Framework: BM25 and Beyond. \u003cem\u003eINR\u003c/em\u003e. 2009;3(4):333\u0026ndash;389. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1561/1500000019\u003c/span\u003e\u003cspan address=\"10.1561/1500000019\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJin Q, Kim W, Chen Q, et al. MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. \u003cem\u003eBioinformatics\u003c/em\u003e. 2023;39(11):btad651. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1093/bioinformatics/btad651\u003c/span\u003e\u003cspan address=\"10.1093/bioinformatics/btad651\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eCormack GV, Clarke CLA, Buettcher S. Reciprocal rank fusion outperforms condorcet and individual rank learning methods. \u003cem\u003eProceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval\u003c/em\u003e. Published online July 19, 2009:758\u0026ndash;759. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1145/1571941.1572114\u003c/span\u003e\u003cspan address=\"10.1145/1571941.1572114\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBraun V, Clarke V. Thematic analysis. In: \u003cem\u003eAPA Handbook of Research Methods in Psychology, Vol 2: Research Designs: Quantitative, Qualitative, Neuropsychological, and Biological\u003c/em\u003e. APA handbooks in psychology\u0026reg;. American Psychological Association; 2012:57\u0026ndash;71. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1037/13620-004\u003c/span\u003e\u003cspan address=\"10.1037/13620-004\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eContinuing to bring you our latest models, with an improved Gemini 2.5 Flash and Flash-Lite release- Google Developers Blog. Accessed December 7, 2025. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://developers.googleblog.com/en/continuing-to-bring-you-our-latest-models-with-an-improved-gemini-2-5-flash-and-flash-lite-release/\u003c/span\u003e\u003cspan address=\"https://developers.googleblog.com/en/continuing-to-bring-you-our-latest-models-with-an-improved-gemini-2-5-flash-and-flash-lite-release/\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eollama/ollama. Published online October 29, 2025. Accessed October 30, 2025. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://github.com/ollama/ollama\u003c/span\u003e\u003cspan address=\"https://github.com/ollama/ollama\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDettmers T, Zettlemoyer L. The case for 4-bit precision: k-bit inference scaling laws. In: \u003cem\u003eProceedings of the 40th International Conference on Machine Learning\u003c/em\u003e. Vol 202. ICML\u0026rsquo;23. JMLR.org; 2023:7750\u0026ndash;7774.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMcHugh ML. Interrater reliability: the kappa statistic. \u003cem\u003eBiochem Med (Zagreb)\u003c/em\u003e. 2012;22(3):276\u0026ndash;282.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eSingh S. Natural Language Processing for Information Extraction. \u003cem\u003earXiv\u003c/em\u003e. Preprint posted online July 6, 2018:arXiv:1807.02383. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.1807.02383\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.1807.02383\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eReichenpfader D, M\u0026uuml;ller H, Denecke K. A scoping review of large language model based approaches for information extraction from radiology reports. \u003cem\u003enpj Digit Med\u003c/em\u003e. 2024;7(1):1\u0026ndash;12. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s41746-024-01219-0\u003c/span\u003e\u003cspan address=\"10.1038/s41746-024-01219-0\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLosch N, Plagwitz L, B\u0026uuml;scher A, Varghese J. Fine-Tuning LLMs on Small Medical Datasets: Text Classification and Normalization Effectiveness on Cardiology reports and Discharge records. \u003cem\u003earXiv\u003c/em\u003e. Preprint posted online March 27, 2025:arXiv:2503.21349. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2503.21349\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2503.21349\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAkbasli IT, Birbilen AZ, Teksam O. Leveraging large language models to mimic domain expert labeling in unstructured text-based electronic healthcare records in non-english languages. \u003cem\u003eBMC Med Inform Decis Mak\u003c/em\u003e. 2025;25(1):154. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1186/s12911-025-02871-6\u003c/span\u003e\u003cspan address=\"10.1186/s12911-025-02871-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNtinopoulos V, Rodriguez Cetina Biefer H, Tudorache I, et al. Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation. \u003cem\u003eBMJ Health Care Inform\u003c/em\u003e. 2025;32(1):e101139. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1136/bmjhci-2024-101139\u003c/span\u003e\u003cspan address=\"10.1136/bmjhci-2024-101139\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLiu NF, Lin K, Hewitt J, et al. Lost in the Middle: How Language Models Use Long Contexts. \u003cem\u003earXiv\u003c/em\u003e. Preprint posted online November 20, 2023:arXiv:2307.03172. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2307.03172\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2307.03172\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMao J, Middleton SE, Niranjan M. Do prompt positions really matter? \u003cem\u003earXiv\u003c/em\u003e. Preprint posted online June 28, 2024:arXiv:2305.14493. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2305.14493\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2305.14493\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLong WJ. Parsing Free Text Nursing Notes. \u003cem\u003eAMIA Annu Symp Proc\u003c/em\u003e. 2003;2003:917.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003ePanch T, Pollard TJ, Mattie H, Lindemer E, Keane PA, Celi LA. \u0026ldquo;Yes, but will it work for my patients?\u0026rdquo; Driving clinically relevant research with benchmark datasets. \u003cem\u003enpj Digit Med\u003c/em\u003e. 2020;3(1):87. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s41746-020-0295-6\u003c/span\u003e\u003cspan address=\"10.1038/s41746-020-0295-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eOng AY, Naughton A, Hornby S, Shwe-Tin A. Impact of an email advice service on filtering and refining ophthalmology referrals in England. \u003cem\u003eInt Ophthalmol\u003c/em\u003e. 2023;43(11):4019\u0026ndash;4025. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1007/s10792-023-02806-y\u003c/span\u003e\u003cspan address=\"10.1007/s10792-023-02806-y\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eStunkel L, Sharma RA, Mackay DD, et al. Patient Harm Due to Diagnostic Error of Neuro-Ophthalmologic Conditions. \u003cem\u003eOphthalmology\u003c/em\u003e. 2021;128(9):1356\u0026ndash;1362. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1016/j.ophtha.2021.03.008\u003c/span\u003e\u003cspan address=\"10.1016/j.ophtha.2021.03.008\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen BY, Antaki F, Gonzalez M, et al. Automated Identification of Stroke Thrombolysis Contraindications from Synthetic Clinical Notes: A Proof-of-Concept Study. \u003cem\u003eCerebrovasc Dis Extra\u003c/em\u003e. 2025;15(1):130\u0026ndash;136. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1159/000545317\u003c/span\u003e\u003cspan address=\"10.1159/000545317\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMatheny ME, Yang J, Smith JC, et al. Enhancing Postmarketing Surveillance of Medical Products With Large Language Models. \u003cem\u003eJAMA Netw Open\u003c/em\u003e. 2024;7(8):e2428276. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1001/jamanetworkopen.2024.28276\u003c/span\u003e\u003cspan address=\"10.1001/jamanetworkopen.2024.28276\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eBalendran A, Benchoufi M, Evgeniou T, Ravaud P. Algorithmovigilance, lessons from pharmacovigilance. \u003cem\u003enpj Digit Med\u003c/em\u003e. 2024;7(1):270. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s41746-024-01237-y\u003c/span\u003e\u003cspan address=\"10.1038/s41746-024-01237-y\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWeck OLD. Multiobjective optimisation: history and promise. Published online 2004.\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eAntaki F, Mikhail D, Milad D, et al. Performance of GPT-5 Frontier Models in Ophthalmology Question Answering. \u003cem\u003earXiv\u003c/em\u003e. Preprint posted online August 13, 2025:arXiv:2508.09956. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2508.09956\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2508.09956\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eChen L, Zaharia M, Zou J. How is ChatGPT\u0026rsquo;s behavior changing over time? \u003cem\u003earXiv\u003c/em\u003e. Preprint posted online October 31, 2023:arXiv:2307.09009. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.48550/arXiv.2307.09009\u003c/span\u003e\u003cspan address=\"10.48550/arXiv.2307.09009\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNature Machine Intelligence. What is in your LLM-based framework? \u003cem\u003eNat Mach Intell\u003c/em\u003e. 2024;6(8):845\u0026ndash;845. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s42256-024-00896-6\u003c/span\u003e\u003cspan address=\"10.1038/s42256-024-00896-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eNature Machine Intelligence. What is in your LLM-based framework? \u003cem\u003eNat Mach Intell\u003c/em\u003e. 2024;6(8):845\u0026ndash;845. doi:\u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003e10.1038/s42256-024-00896-6\u003c/span\u003e\u003cspan address=\"10.1038/s42256-024-00896-6\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"
[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-8921439/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8921439/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eFree-text clinical records represent an untapped wealth of data for secondary use. Their potential is limited by resource demands necessary for accurate information extraction at scale. We introduce a scalable, resource-efficient, and high-performance pipeline which leverages large language models (LLMs) to address these challenges. This was developed and tested using real-world dual specialist-annotated ophthalmic clinical letters. Our pipeline achieved strong performance with a proprietary model in the development phase, yielding a maximum micro-averaged F1 score of 0.954 (95% CI 0.941\u0026ndash;0.967) for diagnosis across nine conditions through iterative prompt refinement alone, also demonstrating strong generalisability (micro-F1 ranging from 0.945\u0026ndash;0.980) in temporal validation. This approach extended to two other proprietary models in the same family and was tested in 17 local models from seven open-weight LLM families, demonstrating robustness against model choice and deployment constraints (for models \u0026gt;\u0026thinsp;10B parameters). Beyond performance, we develop a multi-dimensional assessment to evaluate LLMs for deployment in data extraction tasks, including introducing an error taxonomy to classify failure modes and implementing Pareto frontier analyses to systematically map the operational trade-offs (costs, time) across various LLM configurations. A robust approach to operationalising LLMs in real-world workflows at scale may help lay the foundation for next-generation data pipelines that can accelerate scientific discovery and power continuous learning health systems.\u003c/p\u003e","manuscriptTitle":"Developing a scalable pipeline for data extraction from clinical letters through resource-efficient prompt engineering","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-10 17:35:14","doi":"10.21203/rs.3.rs-8921439/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"
[email protected]","identity":"npj-digital-medicine","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"npjdigitalmed","sideBox":"Learn more about [npj Digital Medicine](http://www.nature.com/npjdigitalmed/)","snPcode":"41746","submissionUrl":"https://submission.springernature.com/new-submission/41746/3","title":"npj Digital Medicine","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"b177c4ff-1007-420c-9688-2c924860f40f","owner":[],"postedDate":"March 10th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":63937269,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":63937270,"name":"Physical sciences/Engineering"},{"id":63937271,"name":"Physical sciences/Mathematics and computing"}],"tags":[],"updatedAt":"2026-04-17T21:53:14+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-10 17:35:14","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8921439","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8921439","identity":"rs-8921439","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}
Text is read by the "Ask this paper" AI Q&A widget below.
Extraction quality varies by source — PMC NXML preserves structure
cleanly, OA-HTML may include some navigation residue, and OA-PDF can
have broken hyphenation. The publisher copy
(via DOI)
is the canonical version.