Prompt Engineering in Large Language Models for BI-RADS Classification of Imaging Reports: A Retrospective Evaluation

doi:10.21203/rs.3.rs-7526460/v1

Prompt Engineering in Large Language Models for BI-RADS Classification of Imaging Reports: A Retrospective Evaluation

2025 · doi:10.21203/rs.3.rs-7526460/v1

preprint OA: closed

Full text JSON View at publisher

Full text 120,636 characters · extracted from preprint-html · click to expand

Prompt Engineering in Large Language Models for BI-RADS Classification of Imaging Reports: A Retrospective Evaluation | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Prompt Engineering in Large Language Models for BI-RADS Classification of Imaging Reports: A Retrospective Evaluation Wenjie Liu, Hailong Wu, Yuanyuan Lang, Yan Luo, Yan Li, Xinyi Liu, and 2 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-7526460/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 5 You are reading this latest preprint version Abstract Purpose To evaluate how prompt engineering modulates large language models' (LLMs) accuracy in Breast Imaging Reporting and Data System (BI-RADS) classification of digital breast tomosynthesis (DBT) reports. Materials and Methods This retrospective study collected reports from 216 patients who underwent DBT for breast cancer screening or diagnosis. BI-RADS classifications were independently assigned to all reports by two experts. Three LLMs (GPT-4o, GPT-o3 mini, Qwen-2.5 max) were utilized to classify all reports using different prompts. Besides, six human readers independently assigned BI-RADS classifications. Agreement between experts and LLMs for BI-RADS categories was evaluated using Weighted Cohen’s kappa (κw). Friedman and Nemenyi tests assessed κw differences among three prompt conditions.The frequencies of changed BI-RADS category assignments, which could impact clinical management, were also calculated. Results In prompt III, GPT-4o achieved near-perfect agreement with experts (κw = 0.80), surpassing GPT-o3 mini (0.76) and Qwen-2.5 max (0.79). Its κw was significantly higher in prompt III than in prompt II (0.69, P, P < 0.05) and prompt I (0.63,, P < 0.01). While GPT-4o's κw remained lower than two mid-level radiologists (0.89 and 0.86), it exceeded two entry-level radiologists (0.76 and 0.79). Regarding clinical management changes, prompt III yielded a 14.8% discordance rate with experts, outperforming prompts I (29.6%) and II (28.2%), and aligning with entry-level radiologists (15.3%, 14.4%). Conclusion With optimized prompts, GPT-4o achieved near-perfect agreement and matched the clinical management performance of entry-level radiologists. These findings support the use of LLMs as an auxiliary tool for BI-RADS classification in breast cancer diagnosis by radiologists. Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Summary Statement Optimized prompting enabled best-performing GPT-4o to achieve near-perfect BI-RADS agreement with experts in DBT reports and low clinical management misclassification rate, indicating LLMs' potential as auxiliary tools for radiologists. Key Results In Prompt III, GPT-4o demonstrated near-perfect agreement with experts (κw, 0.80), outperforming both GPT-o3 mini (0.76) and Qwen (0.79). Notably, the κw of GPT-4o in Prompt III was significantly higher than that observed in Prompt II (0.69, P <0.05) and Prompt I (0.63, P <0.01). The optimization of prompts altered the proportion of BI-RADS category assignments that would result in negative changes in clinical management, aligning with the performance of entry-level radiologists (GPT-4o: Prompt I, 29.6%; Prompt II, 28.2 %; Prompt III, 14.8%. Entry-level radiologists, 15.3% and 14.4%). Introduction In breast cancer screening and management, the Breast Imaging Reporting and Data System (BI-RADS) plays a critical role in guiding clinical decision-making, treatment selection, and prognostic assessment ( 1 , 2 ). However, accurate BI-RADS classification requires extensive clinical expertise, often leading to interobserver variability among radiologists with varying levels of experience ( 3 ). Furthermore, the BI-RADS classification criteria vary across different imaging modalities, including mammography, breast ultrasound, and breast magnetic resonance imaging, further complicating the grading process ( 4 ). These challenges in standardization and consistency emphasize the need for a more efficient solution. In recent years, large language models (LLMs) have emerged, offering a promising new approach to address these limitations ( 5 , 6 ). LLMs are large-scale artificial intelligence systems trained through deep learning techniques, capable of understanding, processing, and generating natural language ( 7 , 8 ). In recent years, the rapid development of LLMs, particularly within the healthcare sector, has attracted significant interest, with their potential applications becoming increasingly recognized ( 9 – 11 ). Models such as the GPT series, exemplify LLMs’ emergence as powerful tools in critical healthcare processes, including medical diagnosis, treatment planning, and patient management, owing to their robust text generation and comprehension capabilities ( 12 – 16 ). Next-generation LLMs, such as GPT-4o, demonstrate enhanced precision and consistency, particularly in areas such as semantic understanding, reasoning, and specialized applications ( 17 ). In recent years, researchers have explored the potential of LLMs for BI-RADS classification from free-text radiology reports ( 5 , 6 ). Although pre-trained LLMs exhibit some accuracy gaps compared to expert radiologists, they have demonstrated significant potential in improving classification consistency and reducing inter-observer variability ( 18 ). Prompt engineering has emerged as a key technique for enhancing LLM performance ( 19 , 20 ); however, its effectiveness in BI-RADS classification has not been comprehensively explored. This study aimed to evaluate LLM performance in BI-RADS classification of free-text digital breast tomosynthesis (DBT) reports, with an emphasis on improving classification accuracy and consistency through prompt optimization. Various prompt engineering strategies were explored to enhance the model’s predictive capability for BI-RADS categorization and support AI-assisted radiologic diagnosis. Materials and Methods Study Design and Sample This retrospective study received approval from the Institutional Review Board of the Second Affiliated Hospital of Nanchang University, which also waived the requirement for written informed consent. Radiology reports from adult women who underwent DBT examinations at the Second Affiliated Hospital of Nanchang University from January 2024 to December 2024 were collected. All reports were authored by radiologists who underwent specialized training and were written exclusively in Chinese. Exclusion criteria included ( 1 ) incomplete reports, ( 2 ) post-surgical examination, ( 3 ) patients aged under 18 years, ( 4 ) BI-RADS 0 and 6 classifications (as these require additional information not included in the imaging descriptions), and ( 5 ) BI-RADS 1 classifications (due to the small number of reports). To ensure an equal distribution across BI-RADS categories ( 2 – 5 ), cases were created under strict adherence to these exclusion criteria. Relevant clinical information, such as patient age, was also collected. At no point during the study were the images from the examinations evaluated. A flowchart detailing the patient selection process is shown in Fig. 1 . Report Processing and Evaluation The clinical information, findings, impression, and BI-RADS categories sections from all exported imaging reports were extracted into a spreadsheet for subsequent analysis. Two certified senior breast radiologists (with 28 and 30 years of post-training experience, respectively) were then invited to independently review these reports according to the 6th edition of the American College of Radiology BI-RADS classification system. Each radiologist assigned a BI-RADS category, which served as the reference. During the evaluation process, the two experts were only provided with the "Findings" section of the reports. BI-RADS Category Assignment by LLMs This study evaluated the performance of three LLMs in BI-RADS classification: GPT-4o (OpenAI, USA), GPT-o3 mini (OpenAI, USA), and Qwen-2.5 max (Alibaba, China), which are newer iterations of widely recognized conversational artificial intelligence applications. Each model was tested with three different prompts, with both the imaging findings and prompts written in Chinese ( Table S1 ). Prompt I included elements of role-playing, grading limitations, and specified task requirements. Prompt II, an extension of Prompt I, suggested assigning a higher grade when the BI-RADS classification was unclear. Prompt III further expanded on Prompt II by recommending special attention to seemingly benign lesions and carefully considering whether the grade should be elevated, consistent with established clinical practice. The findings from the reports and prompts were input into GPT-4o, GPT-o3 mini, and Qwen-2.5 max. After each report was input and a response received, the chat session was restarted. These models were tested using a zero-shot learning approach (i.e., no example data was provided prior to testing). To assess reproducibility, 10 reports were randomly selected from each BI-RADS category and re-entered into the LLMs for evaluation after 7 days. For this study, these models were accessed between February 3, 2025, and February 20, 2025, during which the versions of the three models remained unchanged. BI-RADS Category Assignment by Human Readers This study invited six human readers, categorized by their experience levels: two in-training radiologists (both had 6 months of post-training experience and BI-RADS classification training), two entry-level breast radiologists (with 6 and 7 years of post-training experience, respectively), and two mid-level breast radiologists (with 12 and 10 years of post-training experience, respectively), hereafter referred to as resident 1, resident 2, entry-level radiologist 1, entry-level radiologist 2, mid-level radiologist 1, and mid-level radiologist 2. Each reader independently evaluated and assigned a BI-RADS classification to each report. Statistical Analysis All statistical analyses were performed using Python (version 3.9.13), with the following packages used: Statsmodels (0.13.2), Seaborn (0.11.2), Matplotlib (3.5.2), and Scikit-learn (1.6.1). The Weighted Cohen's kappa (κw) statistic was used to assess the consistency across multiple categories. The κw values and corresponding 95% confidence intervals (CIs) were calculated for the comparison between all LLMs and the reference standard, as well as between the six human readers and the reference standard. The strength of inter-rater agreement was interpreted as follows: κw < 0.00 indicates poor agreement, 0.00-0.20 indicates slight agreement, 0.21–0.40 indicates fair agreement, 0.41–0.60 indicates moderate agreement, 0.61–0.80 indicates substantial agreement, and 0.81-1.00 indicates almost perfect agreement ( 21 ). The significance of the kappa value was assessed using a two-tailed z-test based on asymptotic normal distribution theory. To evaluate significant differences between prompt conditions, the Friedman test ( 22 ) was employed to assess overall differences in κw values across three prompt conditions, followed by a Nemenyi post-hoc test ( 22 ) to determine inter-group significance. A P- value < 0.05 was considered statistically significant. Besides, the assigned BI-RADS categories were grouped according to their associated clinical management pathways: BI-RADS 2 (normal or benign; no intervention required), BI-RADS 3 (probably benign; short-term follow-up), and BI-RADS 4 or 5 (suspicious or highly suggestive of malignancy; biopsy recommended). To determine whether reclassification (downgrade or upgrade) by the six human readers or LLMs would have a negative impact on clinical management, the reference standard was the established scenario for clinical management. Results Study Sample A total of 216 imaging reports were included in the study, with 54 reports allocated to each BI-RADS category. Table 1 summarizes the characteristics of the 216 female patients and their corresponding radiology reports. The mean age of the patients was 50 ± 10 years. All 216 DBT imaging reports were entirely written in Chinese, and the extracted report sections contained an average of 103.24 words. Table 1 Baseline characteristics of patients. Characteristics Value Sex (n = 216) 100% Female Age (years) 50.48 ± 10.07 No. of words in extracted portions of report 103.24 ± 31.44 Report language (n = 216) Chinese only 100% Examination type (n = 216) DBT 100% Reference-standard BI-RADS classification based on radiology report (sixth edition) BI-RADS 2 54 (25%) BI-RADS 3 54 (25%) BI-RADS 4 54 (25%) BI-RADS 5 54 (25%) Data are presented as mean ± standard deviation for normally distributed continuous variables, or number (%) for categorical variables. DBT = digital breast tomosynthesis. BI-RADS = Breast Imaging-Reporting and Data System. Evaluation of BI-RADS Classification of LLMs Across Different Prompts Figure 2 illustrates the performance of the three LLMs in BI-RADS classification across different prompts. In Prompt III, GPT-4o demonstrated near-perfect agreement (κw, 0.80), outperforming both GPT-o3 mini (0.76) and Qwen (0.79). Notably, the κw of GPT-4o in Prompt III was significantly higher than that observed in Prompt II (0.69, P < 0.05) and Prompt I (0.63, P < 0.01). Figure 3 displays the confusion matrices of LLMs in BI-RADS classification across different prompts. With Prompt II (which incorporated an enhanced ambiguity case escalation mechanism building upon Prompt I), all LLMs significantly reduced the misclassification of BI-RADS 5 to 4. Specifically, GPT-4o, Qwen-2.5 max, and GPT-o3 mini exhibited a decrease in misclassified cases of 54% (28 to 13), 50% (12 to 6), and 68% (37 to 12), respectively. However, the improvement in the classification of BI-RADS 3 errors into 2 and BI-RADS 4 errors into 3 or 2 is not significant. With Prompt III (incorporating a low-malignancy probability lesion attention mechanism based on Prompt II), all LLMs not only reduced misclassifications of BI-RADS 5 to 4, but also improved those of BI-RADS from 3 to 2. Specifically, GPT-4o, Qwen-2.5 max, and GPT-o3 mini experienced a reduction in such misclassifications of 61% (from 43 to 17), 65% (from 43 to 15), and (from 46 to 17), respectively. In addition, the misclassification of BI-RADS 4 to 2 was nearly eliminated (GPT-4o: from 11 to 0; Qwen-2.5 max: from 10 to 0, and GPT-o3: mini from 9 to 1). For each prompt, there were no significant differences in the κw levels between the initial BI-RADS classifications and the BI-RADS classifications performed 7 days later for the three LLMs. The κw were nearly perfect with Prompt III (GPT-4o, 0.84; GPT-o3 mini, 0.85; Qwen-2.5 max, 0.90) ( Table S2 ). Prompt II showed substantial to nearly perfect κw values (0.74–0.84), while Prompt I exhibited moderate to nearly perfect agreement (0.68–0.83). Evaluation BI-RADS Classification of LLMs-Human and Human-Human Figure 4 illustrates the performance of LLMs and six physicians with varying levels of experience in BI-RADS classification. The κw values of the two mid-level radiologists (0.90 and 0.86) were significantly higher than the performance of the LLMs under Prompt III. GPT-4o, when utilizing Prompt III (0.80), achieved a high κw, slightly surpassing entry-level radiologist 1 (0.76), and was comparable to entry-level radiologist 2 (0.79), while being notably superior to the two resident radiologists (0.54 and 0.61). LLMs-Human and Human-Human Agreement according to Clinical Management Table 2 – 3 and Table S3 summarize the performance of various LLMs across different prompts regarding clinical management categories. The proportions of changed clinical management for the LLMs were comparable (with GPT-4o demonstrating the best overall performance). However, the proportion of changed clinical management in Prompt III was significantly lower than that of Prompts I and II. Table 4 summarizes the performance of six human readers in clinical management categories, with the top best performers being resident doctor 2, entry-level doctor 2, and mid-level doctor 1). The reference standards included cases requiring no treatment (BI-RADS 2, n = 54), follow-up in 6 months (BI-RADS 3, n = 54), and biopsy or aspiration (BI-RADS 4 or 5, n = 108). Figure 5 illustrates the potential changes in clinical management between LLMs and human interpretations. Table 2 Agreement between LLMs in Prompt I and the reference standard on clinical management reports. Outcome Standard-GPT-4o Standard-GPT-o3 Standard-Qwen Weighted Cohen’s kappa (95% CIs) 0.63 (0.57, 0.69) 0.60 (0.54, 0.66) 0.65 (0.59, 0.71) Changes in management 64/216 (29.6%) 65/216 (30.1%) 69/216 (31.9%) Upgraded 3/216 (1.4%) 2/216 (0.9%) 5/216 (2.3%) From BI-RADS 2 to BI-RADS 3 2/3 (66.7%) 1/2 (50.0%) 2/5 (40.0%) From BI-RADS 2/3 to BI-RADS 4/5 1/3 (33.3%) 1/2 (50.0%) 3/5 (60.0%) Downgraded 61/216 (28.2%) 63/216 (29.2%) 64/216 (29.6%) From BI-RADS 3 to BI-RADS 2 43/61 (70.5%) 46/63 (73.0%) 43/64 (67.2%) From BI-RADS 4/5 to BI-RADS 2 11/61 (18.0%) 9/63 (14.3%) 10/64 (15.6%) From BI-RADS 4/5 to BI-RADS 3 7/61 (11.5%) 8/63 (12.7%) 11/64 (17.2%) Data are expressed as a numerator and a denominator with the percentage in parentheses. LLM = large language model. GPT-o3 = GPT-o3 mini. Qwen = Qwen-2.5 max. CIs = confidence intervals. BI-RADS = Breast Imaging Reporting and Data System. Table 3 Agreement between LLMs in Prompt III and the reference standard on clinical management reports. Outcome Standard-GPT-4o Standard-GPT-o3 Standard-Qwen Weighted Cohen’s kappa (95% CIs) 0.80 (0.75, 0.85) 0.76 (0.71, 0.82) 0.79 (0.74, 0.84) Changes in management 32/216 (14.8%) 41/216 (19.0%) 35/216 (16.2%) Upgraded 2/216 (0.9%) 2/216 (0.9%) 7/216 (3.2%) From BI-RADS 2 to BI-RADS 3 2/2 (100%) 2/2 (100%) 6/7 (85.7%) From BI-RADS 2/3 to BI-RADS 4/5 0/2 (0%) 0/2 (0%) 1/7 (14.3%) Downgraded 30/216 (13.9%) 39/216 (18.1%) 28/216 (13.0%) From BI-RADS 3 to BI-RADS 2 17/30 (56.7%) 17/39 (43.6%) 15/28 (53.6%) From BI-RADS 4/5 to BI-RADS 2 0/30 (0%) 1/39 (2.6%) 0/28 (0%) From BI-RADS 4/5 to BI-RADS 3 13/30 (43.3%) 21/39 (53.8%) 13/28 (46.4%) Data are expressed as a numerator and a denominator with the percentage in parentheses. LLM = large language model. GPT-o3 = GPT-o3 mini. Qwen = Qwen-2.5 max. CIs = confidence intervals. BI-RADS = Breast Imaging Reporting and Data System. Table 4 Agreement between human readers and the reference standard on clinical management reports. Outcome Standard-Resident 1 Standard-Resident 2 Standard-Entry-level 1 Standard- Entry-level 2 Standard- Mid-level 1 Standard- Mid-level 2 Weighted Cohen’s kappa (95% CIs) 0.54 (0.47, 0.61) 0.61 (0.54, 0.67) 0.76 (0.71, 0.81) 0.79 (0.74, 0.84) 0.90 (0.86, 0.93) 0.86 (0.82, 0.90) Changes in management 76/216 (35.2%) 55/216 (25.5%) 33/216 (15.3%) 31/216 (14.4%) 11/216 (5.1%) 23/216 (10.6%) Upgraded 9/216 (4.2%) 9/216 (4.2%) 9/216 (4.2%) 8/216 (3.7%) 3/216 (1.4%) 6/216 (2.8%) From BI-RADS 2 to BI-RADS 3 8/9 (88.9%) 5/9 (55.6%) 7/9 (77.8%) 6/8 (75.0%) 3/3 (100%) 5/6(83.3%) From BI-RADS 2/3 to BI-RADS 4/5 1/9 (11.1%) 4/9 (44.4%) 2/9 (22.2%) 2/8 (25.0%) 0/3 (0%) 1/6 (16.7%) Downgraded 67/216 (31.0%) 46/216 (21.3%) 24/216 (11.1%) 23/216 (10.6%) 8/216 (3.7%) 17/216 (7.9%) From BI-RADS 3 to BI-RADS 2 15/67 (22.4%) 13/46 (28.3%) 10/24 (41.7%) 11/23 (47.8%) 7/8 (87.5%) 11/17 (64.7%) From BI-RADS 4/5 to BI-RADS 2 1/67 (1.5%) 0/46 (0%) 0/24 (0%) 0/23 (0%) 0/8 (0%) 0/17 (0%) From BI-RADS 4/5 to BI-RADS 3 51/67 (76.1%) 33/46 (71.7%) 14/24 (58.3%) 12/23 (52.2%) 1/8 (12.5%) 6/17 (35.3%) Data are expressed as a numerator and a denominator with the percentage in parentheses. Entry-level = Entry-level doctor; Mid-level = Mid-level doctor. CIs = confidence intervals. BI-RADS = Breast Imaging Reporting and Data System When the BI-RADS classifications from both the LLM and the reference standard lead to the same clinical decision (e.g., both BI-RADS 4/5 recommend biopsy), no change in management is required. For underclassification cases, where BI-RADS categories were inappropriately downgraded (e.g., from BI-RADS 4 or 5 to follow-up categories, or from BI-RADS 3, 4, 5 to no intervention), the underclassification rate for GPT-4o in Prompt III was 13.9% (n = 30/216), which was significantly lower than for Prompt I (28.2% [n = 61/216]) and Prompt II (27.8% [n = 60/216]). Entry-level radiologist 2 and mid-level radiologist 1 showed significantly lower underclassification rates than GPT-4o in all prompts (10.6% [n = 23/216] and 3.7% [n = 8/216], respectively), while resident 2 had a higher underclassification rate (21.3% [n = 46/216]) than GPT-4o in Prompt III. The underclassification by GPT-4o under Prompt I and Prompt II mainly involved downgrading from follow-up to no intervention (70.5% [n = 43/61] and 80.0% [n = 48/60], respectively), whereas resident 1 primarily showed downgrading from biopsy-required categories to follow-up (76.1% [n = 51/67]). Conversely, for overclassification cases, where categories were inappropriately upgraded (e.g., from BI-RADS 2 to follow-up, or from BI-RADS 2, 3 to biopsy-required categories), the overclassification rate for GPT-4o in Prompt III was 0.9%, which was comparable to Prompt I (1.4%) and Prompt II (0.5%). Notably, all these overclassification rates were lower than those observed for the three human readers (4.2%, 3.7%, and 1.4%, respectively). Discussion This study represents the first systematic evaluation of LLM performance in BI-RADS classification within DBT free-text imaging reports through prompt engineering. Comparative analysis with expert radiologists revealed that strategic prompting significantly improved the agreement of BI-RADS classification evaluation in LLMs. These findings highlight the potential of prompt engineering to enhance LLM performance in supporting BI-RADS classification, indicating that strategic prompt design may be a crucial tool for improving consistency and accuracy in breast imaging interpretation. Recent advancements in LLMs have demonstrated significant progress in the medical field. For instance, in medical imaging, LLMs have demonstrated capabilities in assisting clinicians with the interpretation of radiological, pathological, and electrocardiographic images, thereby enhancing both efficiency and accuracy in diagnostic workflows ( 23 – 26 ). Furthermore, LLMs exhibit substantial potential in drug discovery, facilitating the identification of novel therapeutic targets and accelerating the development of new pharmaceutical agents ( 27 , 28 ). Beyond data extraction from free-text reports ( 29 ), LLMs can also perform tumor staging and grading through unstructured text analysis, such as leveraging ChatGPT to stage lung cancer based on free-text imaging reports ( 30 ). Despite these promising advancements, particularly in managing complex medical tasks, critical challenges, including instability, opacity, and inaccuracy in LLM outputs, remain substantial barriers to their clinical implementation ( 31 ). Besides, current research underscores the necessity of benchmarking LLM performance against human expertise across specialized domains as more advanced models are introduced. Compared to prior studies, the GPT-4o model in this study demonstrated near-perfect agreement with senior radiologists (GPT-4o, κw = 0.80), significantly surpassing the moderate agreement reported in earlier research (GPT-4, Gwet AC1 = 0.52). Furthermore, the proportion of clinical management changes induced by GPT-4o (14.8%) was notably lower than that observed in previous investigations using GPT-4 (18.1%) ( 5 ). The performance of GPT-4o exceeded that of residents and yielded comparable results to those of entry-level radiologists with limited experience. The advancements in this study may be attributed to three primary innovations. First, the dual-principle prompt III design incorporated rule-based constraints aligned with American College of Radiology BI-RADS guidelines, reducing BI-RADS 5 to 4 misclassification rates by 53.6–67.6% across models through conservative upgrade principles ( 4 ). Besides, a low-heterogeneity attention mechanism mandated cross-validation of subtle imaging features, decreasing BI-RADS 3 to 2 misclassifications by 60.5–65.1% and eliminating 4 to 2 errors in GPT-4o and Qwen-2.5 max. This mechanism specifically addresses confirmation bias in the interpretation of low-risk lesions, a known cognitive pitfall in diagnostic decision-making. Second, GPT-4o’s architectural and training data optimizations enhanced its capacity to process complex linguistic tasks inherent to radiology reporting, outperforming earlier models. Third, restricting BI-RADS categorization outputs to categories 2 to 5 allowed the model to focus on ambiguous cases, improving classification accuracy. Notably, while this study utilized the κw to assess agreement, prior research relied on Gwet’s AC1 statistic. Although both metrics evaluate classification consistency, κw imposes penalties for ordinal misclassifications (e.g., adjacent category errors), enabling more comprehensive performance evaluation. This distinction likely contributes to the higher agreement observed in our study. Notably, the proportion of clinical category management changes under GPT-4o with Prompt III was 14.8%, which was comparable to that of residents, yet significantly higher than that of mid-level radiologists. This suggests that, as an assistive tool, LLMs can facilitate rapid onboarding for novice practitioners and hold potential for reducing workload among entry-level radiologists, while also highlighting the necessity for human oversight in high-risk decision-making scenarios. With the global proliferation and immediate availability of LLMs, particularly in the context of the current shortage of healthcare professionals, patients may upload imaging reports (only findings) to LLMs for interpretation. The process employed in this study, involving zero-shot training and no exposure of LLMs to images, closely simulates this behavior. Such scenarios could raise significant ethical and legal concerns, particularly if patients rely solely on LLM-generated results, which could potentially lead to misdiagnosis or treatment delays. Therefore, to prevent such outcomes, educating patients on the proper use of LLMs for interpreting imaging reports and critically evaluating the generated results is becoming increasingly important. This study has several limitations that should be acknowledged. First, the sample was derived from a single institution and was conducted exclusively in Chinese, which limits the generalizability of the findings to some extent. Future studies should incorporate diverse languages and institutions to validate the applicability and performance of LLMs in varied contexts. Second, while prompt engineering improved LLM performance in classification consistency, this study did not compare LLMs with other artificial intelligence models, such as convolutional neural networks. Furthermore, this study primarily focused on the analysis of classification accuracy while neglecting factors critical in actual clinical settings, such as real-time performance and interpretability. Future research should place greater emphasis on optimizing LLM real-time feedback capabilities and addressing their transparency issues in clinical applications to enhance their operability and acceptance among clinicians. In conclusion, the current study results indicate that although optimizing prompts can enhance the performance of LLMs in BI-RADS classification to some extent, their accuracy remains inferior to that of experienced clinicians. Given the critical role of BI-RADS classification in breast cancer diagnosis and management, LLMs optimized through prompt engineering are not yet capable of fully replacing the expertise of trained healthcare professionals. Therefore, future research should focus on developing a multimodal framework driven by LLMs (e.g., integrating image feature vectors as input to the LLM and modeling them through joint visual-text encoding) to facilitate deeper collaboration between artificial intelligence and clinical expertise, ultimately enhancing healthcare quality and improving patient outcomes. Declarations Ethics approval The studies involving human participants were reviewed and approved by the Institutional Review Board of the Second Affiliated Hospital of Nanchang University. Since data were evaluated retrospectively, pseudonymously and were solely obtained for treatment purposes, a requirement of informed consent was waived by the Institutional Review Board. All authors have confirmed that any experiments involving humans and/or the use of human tissue samples were conducted in accordance with relevant guidelines and regulations. Consent for publication Not applicable. Availability of data and materials The datasets used and analysed during the current study are available from the corresponding author on reasonable request. Competing Interests The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. Funding No funding. Author Contributions Conception and design: W.L. and Y.L. Administrative support: L.G. Collection and assembly of data: W.L. Data analysis and interpretation: W.L. Partial human readers: Y.L., Y.L., H.W., Y.L., W.L., X.L. Manuscript writing: W.L., Y.L., and L.G. Final approval of manuscript: L.G. All authors contributed to the article and approved the submitted version. Acknowledgements Not applicable. References Timmers JM, van Doorne-Nagtegaal HJ, Zonderland HM, et al. The Breast Imaging Reporting and Data System (BI-RADS) in the Dutch breast cancer screening programme: its role as an assessment and stratification tool. Eur Radiol 2012;22(8):1717-1723. doi: 10.1007/s00330-012-2409-2 Spak DA, Plaxco JS, Santiago L, Dryden MJ, Dogan BE. BI-RADS((R)) fifth edition: A summary of changes. Diagn Interv Imaging 2017;98(3):179-190. doi: 10.1016/j.diii.2017.01.001 Ekpo EU, Ujong UP, Mello-Thoms C, McEntee MF. Assessment of Interradiologist Agreement Regarding Mammographic Breast Density Classification Using the Fifth Edition of the BI-RADS Atlas. AJR Am J Roentgenol 2016;206(5):1119-1123. doi: 10.2214/AJR.15.15049 American College of R. ACR BI-RADS atlas : breast imaging reporting and data system. Fifth edition ed. Reston, VA: American College of Radiology, 2013. Cozzi A, Pinker K, Hidber A, et al. BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study. Radiology 2024;311(1):e232133. doi: 10.1148/radiol.232133 Haver HL, Yi PH, Jeudy J, Bahl M. Use of ChatGPT to Assign BI-RADS Assessment Categories to Breast Imaging Reports. AJR Am J Roentgenol 2024;223(3):e2431093. doi: 10.2214/AJR.24.31093 Yang J, Jin H, Tang R, et al. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. ACM Transactions on Knowledge Discovery from Data 2023;18:1-32. doi: 10.1145/3649506 Fan L, Li L, Zihui, Lee S, Yu H, Hemphill L. A Bibliometric Review of Large Language Models Research from 2017 to 2023. ACM Transactions on Intelligent Systems and Technology 2023. doi: 10.48550/arXiv.2304.02020 Will ChatGPT transform healthcare? Nature Medicine 2023;29(3):505-506. doi: 10.1038/s41591-023-02289-5 Mukherjee P, Hou B, Lanfredi RB, Summers RM. Feasibility of Using the Privacy-preserving Large Language Model Vicuna for Labeling Radiology Reports. Radiology 2023;309(1):e231147. doi: 10.1148/radiol.231147 Adams LC, Truhn D, Busch F, et al. Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study. Radiology 2023;307(4):e230725. doi: 10.1148/radiol.230725 Liu J, Wang C, Liu S. Utility of ChatGPT in Clinical Practice. J Med Internet Res 2023;25:e48568. doi: 10.2196/48568 Yeo YH, Samaan JS, Ng WH, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023;29(3):721-732. doi: 10.3350/cmh.2023.0089 Rao A, Pang M, Kim J, et al. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. J Med Internet Res 2023;25:e48659. doi: 10.2196/48659 Ozenbas C, Engin D, Altinok T, Akcay E, Aktas U, Tabanli A. ChatGPT-4o's Performance in Brain Tumor Diagnosis and MRI Findings: A Comparative Analysis with Radiologists. Acad Radiol 2025. doi: 10.1016/j.acra.2025.01.033 Fatima A, Shafique MA, Alam K, Fadlalla Ahmed TK, Mustafa MS. ChatGPT in medicine: A cross-disciplinary systematic review of ChatGPT's (artificial intelligence) role in research, clinical practice, education, and patient interaction. Medicine (Baltimore) 2024;103(32):e39250. doi: 10.1097/MD.0000000000039250 Gallifant J, Fiske A, Levites Strekalova YA, et al. Peer review of GPT-4 technical report and systems card. PLOS Digit Health 2024;3(1):e0000417. doi: 10.1371/journal.pdig.0000417 Liu C, Wei M, Qin Y, et al. Harnessing Large Language Models for Structured Reporting in Breast Ultrasound: A Comparative Study of Open AI (GPT-4.0) and Microsoft Bing (GPT-4). Ultrasound Med Biol 2024;50(11):1697-1703. doi: 10.1016/j.ultrasmedbio.2024.07.007 Afshar M, Gao Y, Wills G, et al. Prompt engineering with a large language model to assist providers in responding to patient inquiries: a real-time implementation in the electronic health record. JAMIA Open 2024;7(3):ooae080. doi: 10.1093/jamiaopen/ooae080 Warren CJ, Edmonds VS, Payne NG, et al. Prompt matters: evaluation of large language model chatbot responses related to Peyronie's disease. Sex Med 2024;12(4):qfae055. doi: 10.1093/sexmed/qfae055 McHugh ML. Interrater reliability: the kappa statistic. Biochem Med (Zagreb) 2012;22(3):276-282. Demšar J. Statistical Comparisons of Classifiers over Multiple Data Sets. J Mach Learn Res 2006;7:1–30. Zhao Z, Wang S, Gu J, et al. ChatCAD+: Toward a Universal and Reliable Interactive CAD Using LLMs. IEEE Trans Med Imaging 2024;43(11):3755-3766. doi: 10.1109/TMI.2024.3398350 Tian Y, Li Z, Jin Y, et al. Foundation model of ECG diagnosis: Diagnostics and explanations of any form and rhythm on ECG. Cell Rep Med 2024;5(12):101875. doi: 10.1016/j.xcrm.2024.101875 Wu SH, Tong WJ, Li MD, et al. Collaborative Enhancement of Consistency and Accuracy in US Diagnosis of Thyroid Nodules Using Large Language Models. Radiology 2024;310(3):e232255. doi: 10.1148/radiol.232255 Waqas A, Bui MM, Glassy EF, et al. Revolutionizing Digital Pathology With the Power of Generative Artificial Intelligence and Foundation Models. Lab Invest 2023;103(11):100255. doi: 10.1016/j.labinv.2023.100255 Tripathi S, Gabriel K, Tripathi PK, Kim E. Large language models reshaping molecular biology and drug development. Chem Biol Drug Des 2024;103(6):e14568. doi: 10.1111/cbdd.14568 Chakraborty C, Bhattacharya M, Lee SS. Artificial intelligence enabled ChatGPT and large language models in drug target discovery, drug discovery, and development. Mol Ther Nucleic Acids 2023;33:866-868. doi: 10.1016/j.omtn.2023.08.009 Le Guellec B, Lefevre A, Geay C, et al. Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports. Radiol Artif Intell 2024;6(4):e230364. doi: 10.1148/ryai.230364 Lee JE, Park KS, Kim YH, Song HC, Park B, Jeong YJ. Lung Cancer Staging Using Chest CT and FDG PET/CT Free-Text Reports: Comparison Among Three ChatGPT Large Language Models and Six Human Readers of Varying Experience. AJR Am J Roentgenol 2024;223(6):e2431696. doi: 10.2214/AJR.24.31696 Ong JCL, Seng BJJ, Law JZF, et al. Artificial intelligence, ChatGPT, and other large language models for social determinants of health: Current state and future directions. Cell Rep Med 2024;5(1):101356. doi: 10.1016/j.xcrm.2023.101356 Additional Declarations No competing interests reported. Supplementary Files Supplementalmaterials.docx Cite Share Download PDF Status: Under Review Version 1 posted Reviewers invited by journal 06 Oct, 2025 Editor assigned by journal 01 Oct, 2025 Editor invited by journal 10 Sep, 2025 Submission checks completed at journal 09 Sep, 2025 First submitted to journal 09 Sep, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-7526460","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":530516281,"identity":"a61d19c3-a276-4b8b-877d-d626d72e93b8","order_by":0,"name":"Wenjie Liu","email":"","orcid":"","institution":"Second Affiliated Hospital of Nanchang University","correspondingAuthor":false,"prefix":"","firstName":"Wenjie","middleName":"","lastName":"Liu","suffix":""},{"id":530516282,"identity":"c9336b1d-e3cb-49dd-bc39-897ada0f2f7c","order_by":1,"name":"Hailong Wu","email":"","orcid":"","institution":"Second Affiliated Hospital of Nanchang University","correspondingAuthor":false,"prefix":"","firstName":"Hailong","middleName":"","lastName":"Wu","suffix":""},{"id":530516283,"identity":"7858060f-a9ec-4d47-987a-1f3b4251556d","order_by":2,"name":"Yuanyuan Lang","email":"","orcid":"","institution":"Second Affiliated Hospital of Nanchang University","correspondingAuthor":false,"prefix":"","firstName":"Yuanyuan","middleName":"","lastName":"Lang","suffix":""},{"id":530516284,"identity":"2f77972f-c101-47e6-9517-770bcdef134f","order_by":3,"name":"Yan Luo","email":"","orcid":"","institution":"Second Affiliated Hospital of Nanchang University","correspondingAuthor":false,"prefix":"","firstName":"Yan","middleName":"","lastName":"Luo","suffix":""},{"id":530516285,"identity":"08845be3-05a4-4671-b070-cb2fd4d382e0","order_by":4,"name":"Yan Li","email":"","orcid":"","institution":"Second Affiliated Hospital of Nanchang University","correspondingAuthor":false,"prefix":"","firstName":"Yan","middleName":"","lastName":"Li","suffix":""},{"id":530516286,"identity":"a0484696-d2c0-4b6c-9280-07d589a1ee0c","order_by":5,"name":"Xinyi Liu","email":"","orcid":"","institution":"Second Affiliated Hospital of Nanchang University","correspondingAuthor":false,"prefix":"","firstName":"Xinyi","middleName":"","lastName":"Liu","suffix":""},{"id":530516287,"identity":"f606f4fc-a514-4ab3-a995-eb21efa1c7bd","order_by":6,"name":"Yinping Leng","email":"","orcid":"","institution":"Second Affiliated Hospital of Nanchang University","correspondingAuthor":false,"prefix":"","firstName":"Yinping","middleName":"","lastName":"Leng","suffix":""},{"id":530516288,"identity":"fbe7ade3-5258-4917-a241-fcc2845587d7","order_by":7,"name":"Lianggeng Gong","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA10lEQVRIiWNgGAWjYLCCBAMJOX725gMHPlQQq+VBgYWxZM+xxIMzzhCpg/HBh4rEDTdyjA/zthChXN699/ALoMNAWj4c4G1gkOcXO4Bfi+GZc2kWQC3GM8+83XBAcgeD4czZCQS0zMgxMwBqke07nrvhgOEZYFDcJlILY8OBnAcHEtuI0CIvkWP8AKhFccKJHIYDB4nRYsBzxgwUL6BANjjYcEaCsF/k23uMP/74UweKysef/1TYyPNLE7LlAAObBBJfAqdKhC0NDMwfCCsbBaNgFIyCEQ0AgjZPRqQFEjYAAAAASUVORK5CYII=","orcid":"","institution":"Second Affiliated Hospital of Nanchang University","correspondingAuthor":true,"prefix":"","firstName":"Lianggeng","middleName":"","lastName":"Gong","suffix":""}],"badges":[],"createdAt":"2025-09-03 11:23:36","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-7526460/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-7526460/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":93883258,"identity":"a0be62b9-c66b-4a19-8d12-d387ab6d91c3","added_by":"auto","created_at":"2025-10-19 16:59:53","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":1872297,"visible":true,"origin":"","legend":"","description":"","filename":"Manuscript.docx","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/46485e7feadb5c97d0c310ba.docx"},{"id":93883259,"identity":"b93863bd-08cc-42da-a50d-c2846670b366","added_by":"auto","created_at":"2025-10-19 16:59:53","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":9140,"visible":true,"origin":"","legend":"","description":"","filename":"81dbfaf9ccfd42fea98bf6ad3cbc8cd3.json","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/9a0368f9987e08b8f1a2ede4.json"},{"id":93883257,"identity":"8a7f356a-a212-444c-ac63-1d50a9c1f274","added_by":"auto","created_at":"2025-10-19 16:59:53","extension":"docx","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":19027,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementalmaterials.docx","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/0934329b8e3c2f157e2700b1.docx"},{"id":93883263,"identity":"e2222ef9-e28d-4fcc-bc70-bd32cd3df7fb","added_by":"auto","created_at":"2025-10-19 16:59:53","extension":"xml","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":108180,"visible":true,"origin":"","legend":"","description":"","filename":"81dbfaf9ccfd42fea98bf6ad3cbc8cd31enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/1f6b1d36a0ec374e9485b797.xml"},{"id":93883261,"identity":"a90f3e6e-8e25-4fe0-be38-59af2ce441ed","added_by":"auto","created_at":"2025-10-19 16:59:53","extension":"jpeg","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":235308,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/5ff92f8a0515b13ebe72efaf.jpeg"},{"id":93881943,"identity":"b9234065-1c5b-4176-b654-678e8ee36841","added_by":"auto","created_at":"2025-10-19 16:51:53","extension":"jpeg","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":125748,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/d3fcf0ff6a431896a3586126.jpeg"},{"id":93881947,"identity":"f91408fc-7b38-4029-a28d-7e64f021b3fa","added_by":"auto","created_at":"2025-10-19 16:51:53","extension":"jpeg","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":703620,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/011068aeda608231b11659f3.jpeg"},{"id":93883260,"identity":"8cd5c476-ec69-4091-b5d0-7d71d393609e","added_by":"auto","created_at":"2025-10-19 16:59:53","extension":"jpeg","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":338520,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage4.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/5845897895a7f2f918d23f07.jpeg"},{"id":93881957,"identity":"048f4e33-a85f-4aad-b4b4-8b8cc461cf28","added_by":"auto","created_at":"2025-10-19 16:51:53","extension":"jpeg","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":364824,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage5.jpeg","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/2c2d881b0e171f2858978a65.jpeg"},{"id":93881952,"identity":"7ab168f2-7c3b-4eed-a4b1-cafbab5dad25","added_by":"auto","created_at":"2025-10-19 16:51:53","extension":"png","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":39408,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/6df54e1e0f7c50ceca8ee09e.png"},{"id":93881953,"identity":"3d4fee17-9244-4d87-9a8f-df1e6a763e6f","added_by":"auto","created_at":"2025-10-19 16:51:53","extension":"png","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":16407,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/b4f66e68c9f69574ebc1af73.png"},{"id":93881955,"identity":"d30212fe-8247-4955-872a-850e7d042aad","added_by":"auto","created_at":"2025-10-19 16:51:53","extension":"png","order_by":11,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":75554,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/503138a5315b31eb8c06fa46.png"},{"id":93881948,"identity":"78644695-739c-47eb-916e-32e8b59093f1","added_by":"auto","created_at":"2025-10-19 16:51:53","extension":"png","order_by":12,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":29557,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage4.png","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/22e8aa6be1857c9d923e75b7.png"},{"id":93881954,"identity":"27ccfbb0-49a7-4dd1-86b0-41fec8349ace","added_by":"auto","created_at":"2025-10-19 16:51:53","extension":"png","order_by":13,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":59697,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage5.png","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/6bd07383a29fb38171761efa.png"},{"id":93881956,"identity":"d542f692-d7c0-4b6d-a50e-e5b0a0df75bd","added_by":"auto","created_at":"2025-10-19 16:51:53","extension":"xml","order_by":14,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":104882,"visible":true,"origin":"","legend":"","description":"","filename":"81dbfaf9ccfd42fea98bf6ad3cbc8cd31structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/4656570684c1623b45a6bde7.xml"},{"id":93883262,"identity":"fa153060-f5d1-491d-8756-8d0e8a941c7b","added_by":"auto","created_at":"2025-10-19 16:59:53","extension":"html","order_by":15,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":115982,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/49e4e5b7d0431a82eefd7e21.html"},{"id":93881942,"identity":"05623687-863e-4440-8110-7b7a8f231bdb","added_by":"auto","created_at":"2025-10-19 16:51:53","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":113125,"visible":true,"origin":"","legend":"\u003cp\u003eFlowchart depicting the patient selection process.\u003c/p\u003e\n\u003cp\u003eDBT= digital breast tomosynthesis, BI-RADS= Breast Imaging-Reporting and Data System, ACR= American College of Radiology, LLM= large language model.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/6e0d2c75cfe7b61ca9df687c.png"},{"id":93881935,"identity":"99bc5450-a03b-487f-b597-f9d9849f7a83","added_by":"auto","created_at":"2025-10-19 16:51:53","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":47016,"visible":true,"origin":"","legend":"\u003cp\u003eWeighted Cohen's kappa of large language models with different prompts. (A) GPT-4o. (B) GPT-o3 mini. (C) Qwen-2.5 max.\u003c/p\u003e\n\u003cp\u003e*\u003cem\u003eP\u003c/em\u003e\u0026lt; 0.05 calculated as described by the Friedman test for the comparison of LLM-LLM κw values.**\u003cem\u003eP\u003c/em\u003e\u0026lt; 0.01 calculated as described by the Friedman test for the comparison of LLM-LLM κw values.\u003c/p\u003e\n\u003cp\u003eLLM= large language model.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/18eadb0672125156c7d2e980.png"},{"id":93881940,"identity":"01d706d0-3ad4-4e98-a449-3a4b802f9e64","added_by":"auto","created_at":"2025-10-19 16:51:53","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":296733,"visible":true,"origin":"","legend":"\u003cp\u003eConfusion matrix of the large language models with different prompts. (A-C) Confusion matrix of the GPT-4o with different prompts. (D-F) Confusion matrix of the GPT-o3 with different prompts. (G-I) Confusion matrix of the Qwen-2.5 max with different prompts.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/6e94ad1612217c7b2ca5195a.png"},{"id":93881945,"identity":"0655e757-66ba-445a-80ba-f725f82fe2f1","added_by":"auto","created_at":"2025-10-19 16:51:53","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":67002,"visible":true,"origin":"","legend":"\u003cp\u003eWeighted Cohen’s kappa of human readers with different levels of experience.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/a3bf35bde9d34640b0ad638b.png"},{"id":93881950,"identity":"965d0ce9-e4dd-4f30-ab19-f234c65d65c8","added_by":"auto","created_at":"2025-10-19 16:51:53","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":162338,"visible":true,"origin":"","legend":"\u003cp\u003eSankey diagrams for GPT-4o and human readers with different levels of experience. (A-C) Sankey diagrams for GPT-4o with different prompts. (D) Sankey diagrams for resident doctors. (E) Sankey diagrams for entry-level doctors. (F) Sankey diagrams for mid-level doctors.\u003c/p\u003e\n\u003cp\u003eBI-RADS= Breast Imaging-Reporting and Data System.\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/5c055e35ee5b8324fd81fb67.png"},{"id":93883515,"identity":"04dd7a60-ed5b-4b61-96f1-a8ed506dde35","added_by":"auto","created_at":"2025-10-19 17:07:53","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1465855,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/96feb3ad-b721-4d57-80bc-4daf915dd34b.pdf"},{"id":93881937,"identity":"e9d9d26b-cfba-42db-b7e4-e50b26a5f2f5","added_by":"auto","created_at":"2025-10-19 16:51:53","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":19027,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementalmaterials.docx","url":"https://assets-eu.researchsquare.com/files/rs-7526460/v1/897773e2305592be9d217842.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"Prompt Engineering in Large Language Models for BI-RADS Classification of Imaging Reports: A Retrospective Evaluation","fulltext":[{"header":"Summary Statement","content":"\u003cp\u003eOptimized prompting enabled best-performing GPT-4o to achieve near-perfect BI-RADS agreement with experts in DBT reports and low clinical management misclassification rate, indicating LLMs' potential as auxiliary tools for radiologists.\u003c/p\u003e"},{"header":"Key Results","content":"\u003cp\u003eIn Prompt III, GPT-4o demonstrated near-perfect agreement with experts (κw, 0.80), outperforming both GPT-o3 mini (0.76) and Qwen (0.79). Notably, the κw of GPT-4o in Prompt III was significantly higher than that observed in Prompt II (0.69, \u003cem\u003eP\u003c/em\u003e\u0026lt;0.05) and Prompt I (0.63, \u003cem\u003eP\u003c/em\u003e\u0026lt;0.01).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eThe optimization of prompts altered the proportion of BI-RADS category assignments that would result in negative changes in clinical management, aligning with the performance of entry-level radiologists (GPT-4o: Prompt I, 29.6%; Prompt II, 28.2 %; Prompt III, 14.8%. Entry-level radiologists, 15.3% and 14.4%).\u003c/p\u003e"},{"header":"Introduction","content":"\u003cp\u003eIn breast cancer screening and management, the Breast Imaging Reporting and Data System (BI-RADS) plays a critical role in guiding clinical decision-making, treatment selection, and prognostic assessment (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e, \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e). However, accurate BI-RADS classification requires extensive clinical expertise, often leading to interobserver variability among radiologists with varying levels of experience (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e). Furthermore, the BI-RADS classification criteria vary across different imaging modalities, including mammography, breast ultrasound, and breast magnetic resonance imaging, further complicating the grading process (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e). These challenges in standardization and consistency emphasize the need for a more efficient solution. In recent years, large language models (LLMs) have emerged, offering a promising new approach to address these limitations (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eLLMs are large-scale artificial intelligence systems trained through deep learning techniques, capable of understanding, processing, and generating natural language (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e7\u003c/span\u003e, \u003cspan citationid=\"CR8\" class=\"CitationRef\"\u003e8\u003c/span\u003e). In recent years, the rapid development of LLMs, particularly within the healthcare sector, has attracted significant interest, with their potential applications becoming increasingly recognized (\u003cspan additionalcitationids=\"CR10\" citationid=\"CR9\" class=\"CitationRef\"\u003e9\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR11\" class=\"CitationRef\"\u003e11\u003c/span\u003e). Models such as the GPT series, exemplify LLMs\u0026rsquo; emergence as powerful tools in critical healthcare processes, including medical diagnosis, treatment planning, and patient management, owing to their robust text generation and comprehension capabilities (\u003cspan additionalcitationids=\"CR13 CR14 CR15\" citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR16\" class=\"CitationRef\"\u003e16\u003c/span\u003e). Next-generation LLMs, such as GPT-4o, demonstrate enhanced precision and consistency, particularly in areas such as semantic understanding, reasoning, and specialized applications (\u003cspan citationid=\"CR17\" class=\"CitationRef\"\u003e17\u003c/span\u003e).\u003c/p\u003e\u003cp\u003eIn recent years, researchers have explored the potential of LLMs for BI-RADS classification from free-text radiology reports (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e, \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e6\u003c/span\u003e). Although pre-trained LLMs exhibit some accuracy gaps compared to expert radiologists, they have demonstrated significant potential in improving classification consistency and reducing inter-observer variability (\u003cspan citationid=\"CR18\" class=\"CitationRef\"\u003e18\u003c/span\u003e). Prompt engineering has emerged as a key technique for enhancing LLM performance (\u003cspan citationid=\"CR19\" class=\"CitationRef\"\u003e19\u003c/span\u003e, \u003cspan citationid=\"CR20\" class=\"CitationRef\"\u003e20\u003c/span\u003e); however, its effectiveness in BI-RADS classification has not been comprehensively explored. This study aimed to evaluate LLM performance in BI-RADS classification of free-text digital breast tomosynthesis (DBT) reports, with an emphasis on improving classification accuracy and consistency through prompt optimization. Various prompt engineering strategies were explored to enhance the model\u0026rsquo;s predictive capability for BI-RADS categorization and support AI-assisted radiologic diagnosis.\u003c/p\u003e"},{"header":"Materials and Methods","content":"\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003eStudy Design and Sample\u003c/h2\u003e\u003cp\u003e This retrospective study received approval from the Institutional Review Board of the Second Affiliated Hospital of Nanchang University, which also waived the requirement for written informed consent. Radiology reports from adult women who underwent DBT examinations at the Second Affiliated Hospital of Nanchang University from January 2024 to December 2024 were collected. All reports were authored by radiologists who underwent specialized training and were written exclusively in Chinese. Exclusion criteria included (\u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e1\u003c/span\u003e) incomplete reports, (\u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e) post-surgical examination, (\u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e3\u003c/span\u003e) patients aged under 18 years, (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e) BI-RADS 0 and 6 classifications (as these require additional information not included in the imaging descriptions), and (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e) BI-RADS 1 classifications (due to the small number of reports). To ensure an equal distribution across BI-RADS categories (\u003cspan additionalcitationids=\"CR3 CR4\" citationid=\"CR2\" class=\"CitationRef\"\u003e2\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e), cases were created under strict adherence to these exclusion criteria. Relevant clinical information, such as patient age, was also collected. At no point during the study were the images from the examinations evaluated. A flowchart detailing the patient selection process is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003eReport Processing and Evaluation\u003c/h3\u003e\n\u003cp\u003eThe clinical information, findings, impression, and BI-RADS categories sections from all exported imaging reports were extracted into a spreadsheet for subsequent analysis. Two certified senior breast radiologists (with 28 and 30 years of post-training experience, respectively) were then invited to independently review these reports according to the 6th edition of the American College of Radiology BI-RADS classification system. Each radiologist assigned a BI-RADS category, which served as the reference. During the evaluation process, the two experts were only provided with the \"Findings\" section of the reports.\u003c/p\u003e\n\u003ch3\u003eBI-RADS Category Assignment by LLMs\u003c/h3\u003e\n\u003cp\u003eThis study evaluated the performance of three LLMs in BI-RADS classification: GPT-4o (OpenAI, USA), GPT-o3 mini (OpenAI, USA), and Qwen-2.5 max (Alibaba, China), which are newer iterations of widely recognized conversational artificial intelligence applications.\u003c/p\u003e\u003cp\u003eEach model was tested with three different prompts, with both the imaging findings and prompts written in Chinese (\u003cb\u003eTable \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e\u003c/b\u003e). Prompt I included elements of role-playing, grading limitations, and specified task requirements. Prompt II, an extension of Prompt I, suggested assigning a higher grade when the BI-RADS classification was unclear. Prompt III further expanded on Prompt II by recommending special attention to seemingly benign lesions and carefully considering whether the grade should be elevated, consistent with established clinical practice. The findings from the reports and prompts were input into GPT-4o, GPT-o3 mini, and Qwen-2.5 max. After each report was input and a response received, the chat session was restarted. These models were tested using a zero-shot learning approach (i.e., no example data was provided prior to testing). To assess reproducibility, 10 reports were randomly selected from each BI-RADS category and re-entered into the LLMs for evaluation after 7 days. For this study, these models were accessed between February 3, 2025, and February 20, 2025, during which the versions of the three models remained unchanged.\u003c/p\u003e\n\u003ch3\u003eBI-RADS Category Assignment by Human Readers\u003c/h3\u003e\n\u003cp\u003eThis study invited six human readers, categorized by their experience levels: two in-training radiologists (both had 6 months of post-training experience and BI-RADS classification training), two entry-level breast radiologists (with 6 and 7 years of post-training experience, respectively), and two mid-level breast radiologists (with 12 and 10 years of post-training experience, respectively), hereafter referred to as resident 1, resident 2, entry-level radiologist 1, entry-level radiologist 2, mid-level radiologist 1, and mid-level radiologist 2. Each reader independently evaluated and assigned a BI-RADS classification to each report.\u003c/p\u003e\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\u003ch2\u003eStatistical Analysis\u003c/h2\u003e\u003cp\u003eAll statistical analyses were performed using Python (version 3.9.13), with the following packages used: Statsmodels (0.13.2), Seaborn (0.11.2), Matplotlib (3.5.2), and Scikit-learn (1.6.1).\u003c/p\u003e\u003cp\u003eThe Weighted Cohen's kappa (κw) statistic was used to assess the consistency across multiple categories. The κw values and corresponding 95% confidence intervals (CIs) were calculated for the comparison between all LLMs and the reference standard, as well as between the six human readers and the reference standard. The strength of inter-rater agreement was interpreted as follows: κw\u0026thinsp;\u0026lt;\u0026thinsp;0.00 indicates poor agreement, 0.00-0.20 indicates slight agreement, 0.21\u0026ndash;0.40 indicates fair agreement, 0.41\u0026ndash;0.60 indicates moderate agreement, 0.61\u0026ndash;0.80 indicates substantial agreement, and 0.81-1.00 indicates almost perfect agreement (\u003cspan citationid=\"CR21\" class=\"CitationRef\"\u003e21\u003c/span\u003e). The significance of the kappa value was assessed using a two-tailed z-test based on asymptotic normal distribution theory. To evaluate significant differences between prompt conditions, the Friedman test (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e) was employed to assess overall differences in κw values across three prompt conditions, followed by a Nemenyi post-hoc test (\u003cspan citationid=\"CR22\" class=\"CitationRef\"\u003e22\u003c/span\u003e) to determine inter-group significance. A \u003cem\u003eP-\u003c/em\u003evalue\u0026thinsp;\u0026lt;\u0026thinsp;0.05 was considered statistically significant.\u003c/p\u003e\u003cp\u003eBesides, the assigned BI-RADS categories were grouped according to their associated clinical management pathways: BI-RADS 2 (normal or benign; no intervention required), BI-RADS 3 (probably benign; short-term follow-up), and BI-RADS 4 or 5 (suspicious or highly suggestive of malignancy; biopsy recommended). To determine whether reclassification (downgrade or upgrade) by the six human readers or LLMs would have a negative impact on clinical management, the reference standard was the established scenario for clinical management.\u003c/p\u003e\u003c/div\u003e"},{"header":"Results","content":"\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\u003ch2\u003eStudy Sample\u003c/h2\u003e\u003cp\u003eA total of 216 imaging reports were included in the study, with 54 reports allocated to each BI-RADS category. Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e summarizes the characteristics of the 216 female patients and their corresponding radiology reports. The mean age of the patients was 50\u0026thinsp;\u0026plusmn;\u0026thinsp;10 years. All 216 DBT imaging reports were entirely written in Chinese, and the extracted report sections contained an average of 103.24 words.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eBaseline characteristics of patients.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"2\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eCharacteristics\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eValue\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eSex (n\u0026thinsp;=\u0026thinsp;216)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e100%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFemale\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eAge (years)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e50.48\u0026thinsp;\u0026plusmn;\u0026thinsp;10.07\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eNo. of words in extracted portions of report\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e103.24\u0026thinsp;\u0026plusmn;\u0026thinsp;31.44\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eReport language (n\u0026thinsp;=\u0026thinsp;216)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eChinese only\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e100%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eExamination type (n\u0026thinsp;=\u0026thinsp;216)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDBT\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e100%\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eReference-standard BI-RADS classification based on radiology report\u003c/p\u003e\u003cp\u003e(sixth edition)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u0026nbsp;\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBI-RADS 2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e54 (25%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBI-RADS 3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e54 (25%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBI-RADS 4\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e54 (25%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eBI-RADS 5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e54 (25%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colspan=\"2\" nameend=\"c2\" namest=\"c1\"\u003e\u003cp\u003eData are presented as mean\u0026thinsp;\u0026plusmn;\u0026thinsp;standard deviation for normally distributed continuous variables, or number (%) for categorical variables.\u003c/p\u003e\u003cp\u003eDBT\u0026thinsp;=\u0026thinsp;digital breast tomosynthesis. BI-RADS\u0026thinsp;=\u0026thinsp;Breast Imaging-Reporting and Data System.\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003c/div\u003e\n\u003ch3\u003eEvaluation of BI-RADS Classification of LLMs Across Different Prompts\u003c/h3\u003e\n\u003cp\u003eFigure \u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e illustrates the performance of the three LLMs in BI-RADS classification across different prompts. In Prompt III, GPT-4o demonstrated near-perfect agreement (κw, 0.80), outperforming both GPT-o3 mini (0.76) and Qwen (0.79). Notably, the κw of GPT-4o in Prompt III was significantly higher than that observed in Prompt II (0.69, \u003cem\u003eP\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.05) and Prompt I (0.63, \u003cem\u003eP\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.01).\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e displays the confusion matrices of LLMs in BI-RADS classification across different prompts. With Prompt II (which incorporated an enhanced ambiguity case escalation mechanism building upon Prompt I), all LLMs significantly reduced the misclassification of BI-RADS 5 to 4. Specifically, GPT-4o, Qwen-2.5 max, and GPT-o3 mini exhibited a decrease in misclassified cases of 54% (28 to 13), 50% (12 to 6), and 68% (37 to 12), respectively. However, the improvement in the classification of BI-RADS 3 errors into 2 and BI-RADS 4 errors into 3 or 2 is not significant.\u003c/p\u003e\u003cp\u003eWith Prompt III (incorporating a low-malignancy probability lesion attention mechanism based on Prompt II), all LLMs not only reduced misclassifications of BI-RADS 5 to 4, but also improved those of BI-RADS from 3 to 2. Specifically, GPT-4o, Qwen-2.5 max, and GPT-o3 mini experienced a reduction in such misclassifications of 61% (from 43 to 17), 65% (from 43 to 15), and (from 46 to 17), respectively. In addition, the misclassification of BI-RADS 4 to 2 was nearly eliminated (GPT-4o: from 11 to 0; Qwen-2.5 max: from 10 to 0, and GPT-o3: mini from 9 to 1).\u003c/p\u003e\u003cp\u003eFor each prompt, there were no significant differences in the κw levels between the initial BI-RADS classifications and the BI-RADS classifications performed 7 days later for the three LLMs. The κw were nearly perfect with Prompt III (GPT-4o, 0.84; GPT-o3 mini, 0.85; Qwen-2.5 max, 0.90) (\u003cb\u003eTable S2\u003c/b\u003e). Prompt II showed substantial to nearly perfect κw values (0.74\u0026ndash;0.84), while Prompt I exhibited moderate to nearly perfect agreement (0.68\u0026ndash;0.83).\u003c/p\u003e\u003cdiv id=\"Sec11\" class=\"Section2\"\u003e\u003ch2\u003eEvaluation BI-RADS Classification of LLMs-Human and Human-Human\u003c/h2\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig4\" class=\"InternalRef\"\u003e4\u003c/span\u003e illustrates the performance of LLMs and six physicians with varying levels of experience in BI-RADS classification. The κw values of the two mid-level radiologists (0.90 and 0.86) were significantly higher than the performance of the LLMs under Prompt III. GPT-4o, when utilizing Prompt III (0.80), achieved a high κw, slightly surpassing entry-level radiologist 1 (0.76), and was comparable to entry-level radiologist 2 (0.79), while being notably superior to the two resident radiologists (0.54 and 0.61).\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003eLLMs-Human and Human-Human Agreement according to Clinical Management\u003c/h2\u003e\u003cp\u003eTable\u0026nbsp;\u003cspan refid=\"Tab2\" class=\"InternalRef\"\u003e2\u003c/span\u003e\u0026ndash;\u003cspan refid=\"Tab3\" class=\"InternalRef\"\u003e3\u003c/span\u003e and \u003cb\u003eTable S3\u003c/b\u003e summarize the performance of various LLMs across different prompts regarding clinical management categories. The proportions of changed clinical management for the LLMs were comparable (with GPT-4o demonstrating the best overall performance). However, the proportion of changed clinical management in Prompt III was significantly lower than that of Prompts I and II. Table\u0026nbsp;\u003cspan refid=\"Tab4\" class=\"InternalRef\"\u003e4\u003c/span\u003e summarizes the performance of six human readers in clinical management categories, with the top best performers being resident doctor 2, entry-level doctor 2, and mid-level doctor 1). The reference standards included cases requiring no treatment (BI-RADS 2, n\u0026thinsp;=\u0026thinsp;54), follow-up in 6 months (BI-RADS 3, n\u0026thinsp;=\u0026thinsp;54), and biopsy or aspiration (BI-RADS 4 or 5, n\u0026thinsp;=\u0026thinsp;108). Figure\u0026nbsp;\u003cspan refid=\"Fig6\" class=\"InternalRef\"\u003e5\u003c/span\u003e illustrates the potential changes in clinical management between LLMs and human interpretations.\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eAgreement between LLMs in Prompt I and the reference standard on clinical management reports.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"4\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eOutcome\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eStandard-GPT-4o\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eStandard-GPT-o3\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eStandard-Qwen\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eWeighted Cohen\u0026rsquo;s kappa (95% CIs)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.63 (0.57, 0.69)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.60 (0.54, 0.66)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.65 (0.59, 0.71)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eChanges in management\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e64/216 (29.6%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e65/216 (30.1%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e69/216 (31.9%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eUpgraded\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e3/216 (1.4%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e2/216 (0.9%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e5/216 (2.3%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrom BI-RADS 2 to BI-RADS 3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e2/3 (66.7%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e1/2 (50.0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e2/5 (40.0%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrom BI-RADS 2/3 to BI-RADS 4/5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e1/3 (33.3%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e1/2 (50.0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e3/5 (60.0%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDowngraded\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e61/216 (28.2%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e63/216 (29.2%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e64/216 (29.6%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrom BI-RADS 3 to BI-RADS 2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e43/61 (70.5%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e46/63 (73.0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e43/64 (67.2%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrom BI-RADS 4/5 to BI-RADS 2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e11/61 (18.0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e9/63 (14.3%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e10/64 (15.6%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrom BI-RADS 4/5 to BI-RADS 3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e7/61 (11.5%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e8/63 (12.7%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e11/64 (17.2%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003ctfoot\u003e\u003ctr\u003e\u003ctd colspan=\"4\"\u003eData are expressed as a numerator and a denominator with the percentage in parentheses.\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd colspan=\"4\"\u003eLLM\u0026thinsp;=\u0026thinsp;large language model. GPT-o3\u0026thinsp;=\u0026thinsp;GPT-o3 mini. Qwen\u0026thinsp;=\u0026thinsp;Qwen-2.5 max. CIs\u0026thinsp;=\u0026thinsp;confidence intervals.\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd colspan=\"4\"\u003eBI-RADS\u0026thinsp;=\u0026thinsp;Breast Imaging Reporting and Data System.\u003c/td\u003e\u003c/tr\u003e\u003c/tfoot\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eAgreement between LLMs in Prompt III and the reference standard on clinical management reports.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"4\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eOutcome\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eStandard-GPT-4o\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eStandard-GPT-o3\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eStandard-Qwen\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eWeighted Cohen\u0026rsquo;s kappa (95% CIs)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.80 (0.75, 0.85)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.76 (0.71, 0.82)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.79 (0.74, 0.84)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eChanges in management\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e32/216 (14.8%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e41/216 (19.0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e35/216 (16.2%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eUpgraded\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e2/216 (0.9%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e2/216 (0.9%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e7/216 (3.2%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrom BI-RADS 2 to BI-RADS 3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e2/2 (100%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e2/2 (100%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e6/7 (85.7%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrom BI-RADS 2/3 to BI-RADS 4/5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0/2 (0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0/2 (0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e1/7 (14.3%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDowngraded\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e30/216 (13.9%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e39/216 (18.1%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e28/216 (13.0%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrom BI-RADS 3 to BI-RADS 2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e17/30 (56.7%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e17/39 (43.6%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e15/28 (53.6%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrom BI-RADS 4/5 to BI-RADS 2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0/30 (0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e1/39 (2.6%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0/28 (0%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrom BI-RADS 4/5 to BI-RADS 3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e13/30 (43.3%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e21/39 (53.8%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e13/28 (46.4%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003ctfoot\u003e\u003ctr\u003e\u003ctd colspan=\"4\"\u003eData are expressed as a numerator and a denominator with the percentage in parentheses.\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd colspan=\"4\"\u003eLLM\u0026thinsp;=\u0026thinsp;large language model. GPT-o3\u0026thinsp;=\u0026thinsp;GPT-o3 mini. Qwen\u0026thinsp;=\u0026thinsp;Qwen-2.5 max. CIs\u0026thinsp;=\u0026thinsp;confidence intervals.\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd colspan=\"4\"\u003eBI-RADS\u0026thinsp;=\u0026thinsp;Breast Imaging Reporting and Data System.\u003c/td\u003e\u003c/tr\u003e\u003c/tfoot\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003e\u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e\u003ccaption language=\"En\"\u003e\u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e\u003cdiv class=\"CaptionContent\"\u003e\u003cp\u003eAgreement between human readers and the reference standard on clinical management reports.\u003c/p\u003e\u003c/div\u003e\u003c/caption\u003e\u003ccolgroup cols=\"7\"\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e\u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e\u003cthead\u003e\u003ctr\u003e\u003cth align=\"left\" colname=\"c1\"\u003e\u003cp\u003eOutcome\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c2\"\u003e\u003cp\u003eStandard-Resident 1\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c3\"\u003e\u003cp\u003eStandard-Resident 2\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c4\"\u003e\u003cp\u003eStandard-Entry-level 1\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c5\"\u003e\u003cp\u003eStandard- Entry-level 2\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c6\"\u003e\u003cp\u003eStandard- Mid-level 1\u003c/p\u003e\u003c/th\u003e\u003cth align=\"left\" colname=\"c7\"\u003e\u003cp\u003eStandard- Mid-level 2\u003c/p\u003e\u003c/th\u003e\u003c/tr\u003e\u003c/thead\u003e\u003ctbody\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eWeighted Cohen\u0026rsquo;s kappa (95% CIs)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e0.54 (0.47, 0.61)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0.61 (0.54, 0.67)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0.76 (0.71, 0.81)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0.79 (0.74, 0.84)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e0.90 (0.86, 0.93)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0.86 (0.82, 0.90)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eChanges in management\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e76/216 (35.2%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e55/216 (25.5%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e33/216 (15.3%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e31/216 (14.4%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e11/216 (5.1%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e23/216 (10.6%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eUpgraded\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e9/216 (4.2%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e9/216 (4.2%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e9/216 (4.2%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e8/216 (3.7%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e3/216 (1.4%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e6/216 (2.8%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrom BI-RADS 2 to BI-RADS 3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e8/9 (88.9%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e5/9 (55.6%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e7/9 (77.8%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e6/8 (75.0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e3/3 (100%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e5/6(83.3%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrom BI-RADS 2/3 to BI-RADS 4/5\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e1/9 (11.1%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e4/9 (44.4%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e2/9 (22.2%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e2/8 (25.0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e0/3 (0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e1/6 (16.7%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eDowngraded\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e67/216 (31.0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e46/216 (21.3%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e24/216 (11.1%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e23/216 (10.6%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e8/216 (3.7%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e17/216 (7.9%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrom BI-RADS 3 to BI-RADS 2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e15/67 (22.4%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e13/46 (28.3%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e10/24 (41.7%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e11/23 (47.8%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e7/8 (87.5%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e11/17 (64.7%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrom BI-RADS 4/5 to BI-RADS 2\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e1/67 (1.5%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e0/46 (0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e0/24 (0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e0/23 (0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e0/8 (0%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e0/17 (0%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd align=\"left\" colname=\"c1\"\u003e\u003cp\u003eFrom BI-RADS 4/5 to BI-RADS 3\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c2\"\u003e\u003cp\u003e51/67 (76.1%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c3\"\u003e\u003cp\u003e33/46 (71.7%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c4\"\u003e\u003cp\u003e14/24 (58.3%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c5\"\u003e\u003cp\u003e12/23 (52.2%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c6\"\u003e\u003cp\u003e1/8 (12.5%)\u003c/p\u003e\u003c/td\u003e\u003ctd align=\"left\" colname=\"c7\"\u003e\u003cp\u003e6/17 (35.3%)\u003c/p\u003e\u003c/td\u003e\u003c/tr\u003e\u003c/tbody\u003e\u003c/colgroup\u003e\u003ctfoot\u003e\u003ctr\u003e\u003ctd colspan=\"7\"\u003eData are expressed as a numerator and a denominator with the percentage in parentheses.\u003c/td\u003e\u003c/tr\u003e\u003ctr\u003e\u003ctd colspan=\"7\"\u003eEntry-level\u0026thinsp;=\u0026thinsp;Entry-level doctor; Mid-level\u0026thinsp;=\u0026thinsp;Mid-level doctor. CIs\u0026thinsp;=\u0026thinsp;confidence intervals. BI-RADS\u0026thinsp;=\u0026thinsp;Breast Imaging Reporting and Data System\u003c/td\u003e\u003c/tr\u003e\u003c/tfoot\u003e\u003c/table\u003e\u003c/div\u003e\u003c/p\u003e\u003cp\u003eWhen the BI-RADS classifications from both the LLM and the reference standard lead to the same clinical decision (e.g., both BI-RADS 4/5 recommend biopsy), no change in management is required. For underclassification cases, where BI-RADS categories were inappropriately downgraded (e.g., from BI-RADS 4 or 5 to follow-up categories, or from BI-RADS 3, 4, 5 to no intervention), the underclassification rate for GPT-4o in Prompt III was 13.9% (n\u0026thinsp;=\u0026thinsp;30/216), which was significantly lower than for Prompt I (28.2% [n\u0026thinsp;=\u0026thinsp;61/216]) and Prompt II (27.8% [n\u0026thinsp;=\u0026thinsp;60/216]). Entry-level radiologist 2 and mid-level radiologist 1 showed significantly lower underclassification rates than GPT-4o in all prompts (10.6% [n\u0026thinsp;=\u0026thinsp;23/216] and 3.7% [n\u0026thinsp;=\u0026thinsp;8/216], respectively), while resident 2 had a higher underclassification rate (21.3% [n\u0026thinsp;=\u0026thinsp;46/216]) than GPT-4o in Prompt III. The underclassification by GPT-4o under Prompt I and Prompt II mainly involved downgrading from follow-up to no intervention (70.5% [n\u0026thinsp;=\u0026thinsp;43/61] and 80.0% [n\u0026thinsp;=\u0026thinsp;48/60], respectively), whereas resident 1 primarily showed downgrading from biopsy-required categories to follow-up (76.1% [n\u0026thinsp;=\u0026thinsp;51/67]). Conversely, for overclassification cases, where categories were inappropriately upgraded (e.g., from BI-RADS 2 to follow-up, or from BI-RADS 2, 3 to biopsy-required categories), the overclassification rate for GPT-4o in Prompt III was 0.9%, which was comparable to Prompt I (1.4%) and Prompt II (0.5%). Notably, all these overclassification rates were lower than those observed for the three human readers (4.2%, 3.7%, and 1.4%, respectively).\u003c/p\u003e\u003c/div\u003e"},{"header":"Discussion","content":"\u003cp\u003eThis study represents the first systematic evaluation of LLM performance in BI-RADS classification within DBT free-text imaging reports through prompt engineering. Comparative analysis with expert radiologists revealed that strategic prompting significantly improved the agreement of BI-RADS classification evaluation in LLMs. These findings highlight the potential of prompt engineering to enhance LLM performance in supporting BI-RADS classification, indicating that strategic prompt design may be a crucial tool for improving consistency and accuracy in breast imaging interpretation.\u003c/p\u003e\u003cp\u003eRecent advancements in LLMs have demonstrated significant progress in the medical field. For instance, in medical imaging, LLMs have demonstrated capabilities in assisting clinicians with the interpretation of radiological, pathological, and electrocardiographic images, thereby enhancing both efficiency and accuracy in diagnostic workflows (\u003cspan additionalcitationids=\"CR24 CR25\" citationid=\"CR23\" class=\"CitationRef\"\u003e23\u003c/span\u003e\u0026ndash;\u003cspan citationid=\"CR26\" class=\"CitationRef\"\u003e26\u003c/span\u003e). Furthermore, LLMs exhibit substantial potential in drug discovery, facilitating the identification of novel therapeutic targets and accelerating the development of new pharmaceutical agents (\u003cspan citationid=\"CR27\" class=\"CitationRef\"\u003e27\u003c/span\u003e, \u003cspan citationid=\"CR28\" class=\"CitationRef\"\u003e28\u003c/span\u003e). Beyond data extraction from free-text reports (\u003cspan citationid=\"CR29\" class=\"CitationRef\"\u003e29\u003c/span\u003e), LLMs can also perform tumor staging and grading through unstructured text analysis, such as leveraging ChatGPT to stage lung cancer based on free-text imaging reports (\u003cspan citationid=\"CR30\" class=\"CitationRef\"\u003e30\u003c/span\u003e). Despite these promising advancements, particularly in managing complex medical tasks, critical challenges, including instability, opacity, and inaccuracy in LLM outputs, remain substantial barriers to their clinical implementation (\u003cspan citationid=\"CR31\" class=\"CitationRef\"\u003e31\u003c/span\u003e). Besides, current research underscores the necessity of benchmarking LLM performance against human expertise across specialized domains as more advanced models are introduced.\u003c/p\u003e\u003cp\u003e Compared to prior studies, the GPT-4o model in this study demonstrated near-perfect agreement with senior radiologists (GPT-4o, κw\u0026thinsp;=\u0026thinsp;0.80), significantly surpassing the moderate agreement reported in earlier research (GPT-4, Gwet AC1\u0026thinsp;=\u0026thinsp;0.52). Furthermore, the proportion of clinical management changes induced by GPT-4o (14.8%) was notably lower than that observed in previous investigations using GPT-4 (18.1%) (\u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e5\u003c/span\u003e). The performance of GPT-4o exceeded that of residents and yielded comparable results to those of entry-level radiologists with limited experience. The advancements in this study may be attributed to three primary innovations. First, the dual-principle prompt III design incorporated rule-based constraints aligned with American College of Radiology BI-RADS guidelines, reducing BI-RADS 5 to 4 misclassification rates by 53.6\u0026ndash;67.6% across models through conservative upgrade principles (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e4\u003c/span\u003e). Besides, a low-heterogeneity attention mechanism mandated cross-validation of subtle imaging features, decreasing BI-RADS 3 to 2 misclassifications by 60.5\u0026ndash;65.1% and eliminating 4 to 2 errors in GPT-4o and Qwen-2.5 max. This mechanism specifically addresses confirmation bias in the interpretation of low-risk lesions, a known cognitive pitfall in diagnostic decision-making. Second, GPT-4o\u0026rsquo;s architectural and training data optimizations enhanced its capacity to process complex linguistic tasks inherent to radiology reporting, outperforming earlier models. Third, restricting BI-RADS categorization outputs to categories 2 to 5 allowed the model to focus on ambiguous cases, improving classification accuracy.\u003c/p\u003e\u003cp\u003eNotably, while this study utilized the κw to assess agreement, prior research relied on Gwet\u0026rsquo;s AC1 statistic. Although both metrics evaluate classification consistency, κw imposes penalties for ordinal misclassifications (e.g., adjacent category errors), enabling more comprehensive performance evaluation. This distinction likely contributes to the higher agreement observed in our study.\u003c/p\u003e\u003cp\u003eNotably, the proportion of clinical category management changes under GPT-4o with Prompt III was 14.8%, which was comparable to that of residents, yet significantly higher than that of mid-level radiologists. This suggests that, as an assistive tool, LLMs can facilitate rapid onboarding for novice practitioners and hold potential for reducing workload among entry-level radiologists, while also highlighting the necessity for human oversight in high-risk decision-making scenarios.\u003c/p\u003e\u003cp\u003eWith the global proliferation and immediate availability of LLMs, particularly in the context of the current shortage of healthcare professionals, patients may upload imaging reports (only findings) to LLMs for interpretation. The process employed in this study, involving zero-shot training and no exposure of LLMs to images, closely simulates this behavior. Such scenarios could raise significant ethical and legal concerns, particularly if patients rely solely on LLM-generated results, which could potentially lead to misdiagnosis or treatment delays. Therefore, to prevent such outcomes, educating patients on the proper use of LLMs for interpreting imaging reports and critically evaluating the generated results is becoming increasingly important.\u003c/p\u003e\u003cp\u003eThis study has several limitations that should be acknowledged. First, the sample was derived from a single institution and was conducted exclusively in Chinese, which limits the generalizability of the findings to some extent. Future studies should incorporate diverse languages and institutions to validate the applicability and performance of LLMs in varied contexts. Second, while prompt engineering improved LLM performance in classification consistency, this study did not compare LLMs with other artificial intelligence models, such as convolutional neural networks. Furthermore, this study primarily focused on the analysis of classification accuracy while neglecting factors critical in actual clinical settings, such as real-time performance and interpretability. Future research should place greater emphasis on optimizing LLM real-time feedback capabilities and addressing their transparency issues in clinical applications to enhance their operability and acceptance among clinicians.\u003c/p\u003e\u003cp\u003eIn conclusion, the current study results indicate that although optimizing prompts can enhance the performance of LLMs in BI-RADS classification to some extent, their accuracy remains inferior to that of experienced clinicians. Given the critical role of BI-RADS classification in breast cancer diagnosis and management, LLMs optimized through prompt engineering are not yet capable of fully replacing the expertise of trained healthcare professionals. Therefore, future research should focus on developing a multimodal framework driven by LLMs (e.g., integrating image feature vectors as input to the LLM and modeling them through joint visual-text encoding) to facilitate deeper collaboration between artificial intelligence and clinical expertise, ultimately enhancing healthcare quality and improving patient outcomes.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthics approval\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe studies involving human participants were reviewed and approved by the Institutional Review Board of the Second Affiliated Hospital of Nanchang University. Since data were evaluated retrospectively, pseudonymously and were solely obtained for treatment purposes, a requirement of informed consent was waived by the Institutional Review Board. All authors have confirmed that any experiments involving humans and/or the use of human tissue samples were conducted in accordance with relevant guidelines and regulations.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsent for publication\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAvailability of data and materials\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets used and analysed during the current study are available from the corresponding author on reasonable request.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting Interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNo funding.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor Contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eConception and design: W.L. and Y.L. Administrative support: L.G. Collection and assembly of data: W.L. Data analysis and interpretation: W.L. Partial human readers: Y.L., Y.L., H.W., Y.L., W.L., X.L. \u0026nbsp;Manuscript writing: W.L., Y.L., and L.G. Final approval of manuscript: L.G. All authors contributed to the article and approved the submitted version.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eTimmers JM, van Doorne-Nagtegaal HJ, Zonderland HM, et al. The Breast Imaging Reporting and Data System (BI-RADS) in the Dutch breast cancer screening programme: its role as an assessment and stratification tool. Eur Radiol 2012;22(8):1717-1723. doi: 10.1007/s00330-012-2409-2\u003c/li\u003e\n \u003cli\u003eSpak DA, Plaxco JS, Santiago L, Dryden MJ, Dogan BE. BI-RADS((R)) fifth edition: A summary of changes. Diagn Interv Imaging 2017;98(3):179-190. doi: 10.1016/j.diii.2017.01.001\u003c/li\u003e\n \u003cli\u003eEkpo EU, Ujong UP, Mello-Thoms C, McEntee MF. Assessment of Interradiologist Agreement Regarding Mammographic Breast Density Classification Using the Fifth Edition of the BI-RADS Atlas. AJR Am J Roentgenol 2016;206(5):1119-1123. doi: 10.2214/AJR.15.15049\u003c/li\u003e\n \u003cli\u003eAmerican College of R. ACR BI-RADS atlas : breast imaging reporting and data system. Fifth edition ed. Reston, VA: American College of Radiology, 2013.\u003c/li\u003e\n \u003cli\u003eCozzi A, Pinker K, Hidber A, et al. BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study. Radiology 2024;311(1):e232133. doi: 10.1148/radiol.232133\u003c/li\u003e\n \u003cli\u003eHaver HL, Yi PH, Jeudy J, Bahl M. Use of ChatGPT to Assign BI-RADS Assessment Categories to Breast Imaging Reports. AJR Am J Roentgenol 2024;223(3):e2431093. doi: 10.2214/AJR.24.31093\u003c/li\u003e\n \u003cli\u003eYang J, Jin H, Tang R, et al. Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond. ACM Transactions on Knowledge Discovery from Data 2023;18:1-32. doi: 10.1145/3649506\u003c/li\u003e\n \u003cli\u003eFan L, Li L, Zihui, Lee S, Yu H, Hemphill L. A Bibliometric Review of Large Language Models Research from 2017 to 2023. ACM Transactions on Intelligent Systems and Technology 2023. doi: 10.48550/arXiv.2304.02020\u003c/li\u003e\n \u003cli\u003eWill ChatGPT transform healthcare? Nature Medicine 2023;29(3):505-506. doi: 10.1038/s41591-023-02289-5\u003c/li\u003e\n \u003cli\u003eMukherjee P, Hou B, Lanfredi RB, Summers RM. Feasibility of Using the Privacy-preserving Large Language Model Vicuna for Labeling Radiology Reports. Radiology 2023;309(1):e231147. doi: 10.1148/radiol.231147\u003c/li\u003e\n \u003cli\u003eAdams LC, Truhn D, Busch F, et al. Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study. Radiology 2023;307(4):e230725. doi: 10.1148/radiol.230725\u003c/li\u003e\n \u003cli\u003eLiu J, Wang C, Liu S. Utility of ChatGPT in Clinical Practice. J Med Internet Res 2023;25:e48568. doi: 10.2196/48568\u003c/li\u003e\n \u003cli\u003eYeo YH, Samaan JS, Ng WH, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023;29(3):721-732. doi: 10.3350/cmh.2023.0089\u003c/li\u003e\n \u003cli\u003eRao A, Pang M, Kim J, et al. Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study. J Med Internet Res 2023;25:e48659. doi: 10.2196/48659\u003c/li\u003e\n \u003cli\u003eOzenbas C, Engin D, Altinok T, Akcay E, Aktas U, Tabanli A. ChatGPT-4o\u0026apos;s Performance in Brain Tumor Diagnosis and MRI Findings: A Comparative Analysis with Radiologists. Acad Radiol 2025. doi: 10.1016/j.acra.2025.01.033\u003c/li\u003e\n \u003cli\u003eFatima A, Shafique MA, Alam K, Fadlalla Ahmed TK, Mustafa MS. ChatGPT in medicine: A cross-disciplinary systematic review of ChatGPT\u0026apos;s (artificial intelligence) role in research, clinical practice, education, and patient interaction. Medicine (Baltimore) 2024;103(32):e39250. doi: 10.1097/MD.0000000000039250\u003c/li\u003e\n \u003cli\u003eGallifant J, Fiske A, Levites Strekalova YA, et al. Peer review of GPT-4 technical report and systems card. PLOS Digit Health 2024;3(1):e0000417. doi: 10.1371/journal.pdig.0000417\u003c/li\u003e\n \u003cli\u003eLiu C, Wei M, Qin Y, et al. Harnessing Large Language Models for Structured Reporting in Breast Ultrasound: A Comparative Study of Open AI (GPT-4.0) and Microsoft Bing (GPT-4). Ultrasound Med Biol 2024;50(11):1697-1703. doi: 10.1016/j.ultrasmedbio.2024.07.007\u003c/li\u003e\n \u003cli\u003eAfshar M, Gao Y, Wills G, et al. Prompt engineering with a large language model to assist providers in responding to patient inquiries: a real-time implementation in the electronic health record. JAMIA Open 2024;7(3):ooae080. doi: 10.1093/jamiaopen/ooae080\u003c/li\u003e\n \u003cli\u003eWarren CJ, Edmonds VS, Payne NG, et al. Prompt matters: evaluation of large language model chatbot responses related to Peyronie\u0026apos;s disease. Sex Med 2024;12(4):qfae055. doi: 10.1093/sexmed/qfae055\u003c/li\u003e\n \u003cli\u003eMcHugh ML. Interrater reliability: the kappa statistic. Biochem Med (Zagreb) 2012;22(3):276-282.\u003c/li\u003e\n \u003cli\u003eDem\u0026scaron;ar J. Statistical Comparisons of Classifiers over Multiple Data Sets. J Mach Learn Res 2006;7:1\u0026ndash;30.\u003c/li\u003e\n \u003cli\u003eZhao Z, Wang S, Gu J, et al. ChatCAD+: Toward a Universal and Reliable Interactive CAD Using LLMs. IEEE Trans Med Imaging 2024;43(11):3755-3766. doi: 10.1109/TMI.2024.3398350\u003c/li\u003e\n \u003cli\u003eTian Y, Li Z, Jin Y, et al. Foundation model of ECG diagnosis: Diagnostics and explanations of any form and rhythm on ECG. Cell Rep Med 2024;5(12):101875. doi: 10.1016/j.xcrm.2024.101875\u003c/li\u003e\n \u003cli\u003eWu SH, Tong WJ, Li MD, et al. Collaborative Enhancement of Consistency and Accuracy in US Diagnosis of Thyroid Nodules Using Large Language Models. Radiology 2024;310(3):e232255. doi: 10.1148/radiol.232255\u003c/li\u003e\n \u003cli\u003eWaqas A, Bui MM, Glassy EF, et al. Revolutionizing Digital Pathology With the Power of Generative Artificial Intelligence and Foundation Models. Lab Invest 2023;103(11):100255. doi: 10.1016/j.labinv.2023.100255\u003c/li\u003e\n \u003cli\u003eTripathi S, Gabriel K, Tripathi PK, Kim E. Large language models reshaping molecular biology and drug development. Chem Biol Drug Des 2024;103(6):e14568. doi: 10.1111/cbdd.14568\u003c/li\u003e\n \u003cli\u003eChakraborty C, Bhattacharya M, Lee SS. Artificial intelligence enabled ChatGPT and large language models in drug target discovery, drug discovery, and development. Mol Ther Nucleic Acids 2023;33:866-868. doi: 10.1016/j.omtn.2023.08.009\u003c/li\u003e\n \u003cli\u003eLe Guellec B, Lefevre A, Geay C, et al. Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports. Radiol Artif Intell 2024;6(4):e230364. doi: 10.1148/ryai.230364\u003c/li\u003e\n \u003cli\u003eLee JE, Park KS, Kim YH, Song HC, Park B, Jeong YJ. Lung Cancer Staging Using Chest CT and FDG PET/CT Free-Text Reports: Comparison Among Three ChatGPT Large Language Models and Six Human Readers of Varying Experience. AJR Am J Roentgenol 2024;223(6):e2431696. doi: 10.2214/AJR.24.31696\u003c/li\u003e\n \u003cli\u003eOng JCL, Seng BJJ, Law JZF, et al. Artificial intelligence, ChatGPT, and other large language models for social determinants of health: Current state and future directions. Cell Rep Med 2024;5(1):101356. doi: 10.1016/j.xcrm.2023.101356\u003cstrong\u003e\u003cbr\u003e\u003c/strong\u003e\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"bmc-cancer","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"bcan","sideBox":"Learn more about [BMC Cancer](http://bmccancer.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/bcan/default.aspx","title":"BMC Cancer","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-7526460/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-7526460/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003ePurpose\u003c/h2\u003e\u003cp\u003eTo evaluate how prompt engineering modulates large language models' (LLMs) accuracy in Breast Imaging Reporting and Data System (BI-RADS) classification of digital breast tomosynthesis (DBT) reports.\u003c/p\u003e\u003ch2\u003eMaterials and Methods\u003c/h2\u003e\u003cp\u003eThis retrospective study collected reports from 216 patients who underwent DBT for breast cancer screening or diagnosis. BI-RADS classifications were independently assigned to all reports by two experts. Three LLMs (GPT-4o, GPT-o3 mini, Qwen-2.5 max) were utilized to classify all reports using different prompts. Besides, six human readers independently assigned BI-RADS classifications. Agreement between experts and LLMs for BI-RADS categories was evaluated using Weighted Cohen\u0026rsquo;s kappa (κw). Friedman and Nemenyi tests assessed κw differences among three prompt conditions.The frequencies of changed BI-RADS category assignments, which could impact clinical management, were also calculated.\u003c/p\u003e\u003ch2\u003eResults\u003c/h2\u003e\u003cp\u003eIn prompt III, GPT-4o achieved near-perfect agreement with experts (κw\u0026thinsp;=\u0026thinsp;0.80), surpassing GPT-o3 mini (0.76) and Qwen-2.5 max (0.79). Its κw was significantly higher in prompt III than in prompt II (0.69, P, \u003cem\u003eP\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.05) and prompt I (0.63,, \u003cem\u003eP\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.01). While GPT-4o's κw remained lower than two mid-level radiologists (0.89 and 0.86), it exceeded two entry-level radiologists (0.76 and 0.79). Regarding clinical management changes, prompt III yielded a 14.8% discordance rate with experts, outperforming prompts I (29.6%) and II (28.2%), and aligning with entry-level radiologists (15.3%, 14.4%).\u003c/p\u003e\u003ch2\u003eConclusion\u003c/h2\u003e\u003cp\u003e With optimized prompts, GPT-4o achieved near-perfect agreement and matched the clinical management performance of entry-level radiologists. These findings support the use of LLMs as an auxiliary tool for BI-RADS classification in breast cancer diagnosis by radiologists.\u003c/p\u003e","manuscriptTitle":"Prompt Engineering in Large Language Models for BI-RADS Classification of Imaging Reports: A Retrospective Evaluation","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-10-19 16:51:48","doi":"10.21203/rs.3.rs-7526460/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"reviewersInvited","content":"","date":"2025-10-06T08:55:33+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-10-01T08:43:57+00:00","index":"","fulltext":""},{"type":"editorInvited","content":"","date":"2025-09-10T05:54:43+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-09-09T15:37:03+00:00","index":"","fulltext":""},{"type":"submitted","content":"BMC Cancer","date":"2025-09-09T15:33:58+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"bmc-cancer","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"bcan","sideBox":"Learn more about [BMC Cancer](http://bmccancer.biomedcentral.com/)","snPcode":"","submissionUrl":"https://www.editorialmanager.com/bcan/default.aspx","title":"BMC Cancer","twitterHandle":"BMC_series","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"em","reportingPortfolio":"BMC Series","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"a22672fe-78a9-4142-961d-b80bfc25fa1d","owner":[],"postedDate":"October 19th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[],"tags":[],"updatedAt":"2025-10-19T16:51:48+00:00","versionOfRecord":[],"versionCreatedAt":"2025-10-19 16:51:48","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-7526460","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-7526460","identity":"rs-7526460","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00