Retrieval-Augmented-Generation large language models outperform junior clinicians in guideline-concordant PSA testing

doi:10.21203/rs.3.rs-5760861/v1

Retrieval-Augmented-Generation large language models outperform junior clinicians in guideline-concordant PSA testing

2025 · doi:10.21203/rs.3.rs-5760861/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 83,570 characters · extracted from preprint-html · click to expand

Retrieval-Augmented-Generation large language models outperform junior clinicians in guideline-concordant PSA testing | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Retrieval-Augmented-Generation large language models outperform junior clinicians in guideline-concordant PSA testing Joshua Yi Min Tung, Quan Le, Jinxuan Yao, Yifei Huang, Daniel Yan Zheng Lim, and 9 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-5760861/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Background and Objective Society guidelines for prostate cancer screening via PSA testing serve to standardize patient care, and are often utilized by trainees, junior staff, or generalist medical practitioners to guide medical decision-making. Adherence to guidelines is a time-consuming and challenging task and rates of inappropriate PSA testing are high. This study evaluates a retrieval-augmented generation (RAG) enhanced large language model (LLM), grounded in current EAU and AUA guidelines, to assess its effectiveness in providing guideline-concordant PSA screening recommendations compared to junior clinicians. Methods A retrieval-augmented generation (RAG) pipeline was developed and used to process a series of 44 fictional case scenarios. Five junior clinicians were tasked to provide PSA testing recommendations for the same scenarios, in closed-book and open-book formats. Answers were compared for accuracy in a binomial fashion. Key Findings and Limitations The RAG-LLM tool provided guideline-concordant recommendations in 95.5% of case scenarios, compared to junior clinicians, who were correct in 62.3% of scenarios in a closed-book format, and 74.1% of scenarios in an open book format. The difference was statistically significant for both closed-book ( p < 0.001) and open-book ( p < 0.001) formats. Conclusions and Clinical Implications Use of RAG techniques allows LLMs to integrate complex guidelines into day-to-day medical decision-making. RAG-LLM tools in Urology have the capability to enhance clinical decision-making by providing guideline-concordant recommendations for PSA testing, potentially improving the consistency of healthcare delivery, reducing cognitive load on clinicians, and reducing unnecessary investigations and costs. Health sciences/Urology Health sciences/Medical research Artificial intelligence large language models retrieval augmented generation Figures Figure 1 1. Introduction Prostate cancer is the second most commonly diagnosed cancer and the fifth leading cause of cancer-related death among men globally 1 . Screening for prostate cancer is thus a common issue in both primary and specialist care settings. Prostate-specific antigen (PSA) testing is the most widely used method for early detection, but remains a controversial issue in urological literature, largely owing to the harms associated with over-diagnosis and over-treatment 2,3 . Society guidelines for prostate cancer screening via PSA testing serve to streamline and standardize patient care, and are often utilized by trainees, junior staff, or non-specialist medical practitioners to guide medical decision-making. Such guidelines have been issued by various organizations such as the European Association of Urology (EAU) 4 and American Urological Association (AUA) 5 , but discrepancies between these guidelines – such as recommendations on whether PSA screening should be offered, the appropriate patient populations, and screening intervals – pose challenges for clinical decision making. These are further complicated by the need to consider other patient factors, such as the need to calculate estimated life-expectancy (as many guidelines do not recommend PSA screening in patients with a < 10- or < 15-year life-expectancy), and the need to consider the patient’s own preferences. Shared decision making (SDM) forms a key component in both the EAU and AUA guidelines, particularly in older men, or those with multiple medical comorbidities. The current EAU-EANM-ESTRO-ESUR-ISUP-SIOG and AUA/SUO guidelines on prostate cancer and early detection of prostate cancer stand at 239 and 47 pages respectively. Appropriate decision-making and adherence to guidelines is therefore a time-consuming and challenging task for non-specialists in a primary care setting, and even in the specialist-outpatient setting where time constraints are common. Prior studies have shown a low rate of compliance to organizational guidelines, such as a cohort study of 32,306 males showing that 40% of men in the > 80-year age group in a cohort received inappropriate PSA screening 6 . One potential solution to this problem is by using AI to parse guidelines and deliver an appropriate recommendation. Large language models (LLMs) are a form of artificial intelligence (AI) that are trained on large amounts of text data, and hence have the capability to process unstructured text inputs and generate appropriate responses. They can thus be applied in healthcare, such as in patient communications, education, and clinical risk stratification 7 . However, general LLMs such as the GPT models developed by OpenAI are not specifically designed for healthcare use, and can produce inaccurate or misleading information. They have a knowledge cut-off based on the recency of the underlying training data, for example January 2022 for the current generation of OpenAI GPT models. To address these limitations, retrieval-augmented generation (RAG) techniques have been developed to enhance the accuracy of LLMs. RAG directs the LLM to answer a given scenario by reference to an additional database of curated information, such as a set of guidelines. By ‘grounding’ the responses using relevant information from the database, LLMs can overcome their intrinsic knowledge cut-off, and produce responses with less hallucination. The aim of this study was thus to evaluate the accuracy of a RAG-enabled LLM model, which had been grounded in the EAU and AUA guidelines pertaining to prostate cancer screening. 2. Methods This study was conducted in a simulated environment using only fictional patient data. As the use of fictional data does not fall under local Human Biomedical Research Act regulations, ethical approval was not required. 2.1. Development of case scenarios A series of 44 fictional case scenarios were developed to reflect a range of clinical presentations at an outpatient clinic setting. These free text scenarios included fictional patient biodata such as age, medical comorbidities, presence or absence of urological symptoms (such as hematuria or lower urinary tract symptoms, if any), and prior PSA readings (if applicable). These were written by a urology fellow with 8 years of clinical experience, supervised by 2 urology consultants with > 20 years of clinical experience each. 2.2. Development of RAG-enabled LLM model We developed an automated pipeline to process case scenarios based on how a healthcare provider would provide a PSA testing recommendation. The schematic diagram is shown in Fig. 1 . Key components of this pipeline include: An LLM-based calculator to extract relevant patient information (age and comorbidities) from the case scenario, in order to calculate the Charlson Comorbidity Index (CCI) and thereby estimate the expected 10-year life expectancy. Patients who are not expected to live at least 10 years are not recommended for PSA screening 4,5 , and the pipeline did not allow such case scenarios to proceed. Likewise, scenarios where the patient was > 72 years of age were also not permitted to proceed. We provide further technical details of the CCI calculator in Appendix 1 . For patients with at least 10-year life expectancy based on CCI scores, a RAG-enabled LLM was used to provide a recommendation based on the given case scenario. In comparison with standard “off-the-shelf” LLMs which are not trained on domain-specific medical information, RAG allows the LLM to reference a fixed set of material, such as the relevant EAU and AUA society guidelines in this study. Language models augmented in this way with contextualized information can overcome their intrinsic knowledge deficits, and reduce hallucination by constraining the response on the given information. Since the AUA & EAU guidelines occasionally provide different and non-overlapping recommendations, separate answers were generated from each set of guidelines first, before combining them for the final recommendations. We provide further technical details of the RAG-enabled LLM in Appendix 2 . These include explanations of modern RAG techniques applied to optimize performance, such as context filtering to improve retrieval of relevant information, and advanced prompting methods (Chain-Of-Thought reasoning 8 , constraining answers to be based on retrieved information, providing example output structure, and use of an expert clinician persona). The full RAG prompt can be found in Appendix 3 . 2.3. Relevant Software The RAG prototype was developed with Python 3.10. Vector databases were constructed using Unstructured API for ingestion of PDF documents, OpenAI API for generation of text embeddings and Qdrant as the vector database. For LLM calls, we used both OpenAI and Anthropic APIs for different components in our pipeline. We used both LlamaIndex and Langchain for orchestration, with LlamaIndex handling retrieval of augmented generation components, whereas Langchain was used for structured data extraction and connecting pipeline components. 2.4. Answer Generation & Grading Five junior clinicians were tasked to provide recommendations on PSA testing for each of the case scenarios. They comprised a mix of medical officers, urology junior residents, and family medicine residents. Each clinician completed the task in a “closed book” format, followed by an “open book” format in which they were permitted to reference relevant material of their choice (guidelines, textbooks). The time taken to complete the task in each format was recorded. The RAG-LLM tool was likewise provided with the same set of fictional case scenarios and instructed to provide recommendations on PSA testing. We conducted 5 runs to assess the consistency of the LLM output. Answers were graded by the study team in a binomial format (correct/incorrect). Answers were marked as correct if they were concordant with either the EAU or AUA guidelines. 2.5. Statistical Analysis SPSS version 26.0 (IBM Corp) was used for the statistical analysis of quantitative data. Answers from the RAG-LLM tool and human comparators were compared using Student’s t -test. 3. Results The RAG-LLM tool provided guideline-concordant recommendations in 95.5% of case scenarios, compared to junior clinicians, who were correct in 62.3% of scenarios in a closed-book format, and 74.1% of scenarios in an open book format. The difference was statistically significant for both closed-book ( p < 0.001) and open-book ( p < 0.001) formats. Cases were divided into screening (20 cases) and follow-up (24 cases) categories. The RAG-LLM tool provided an incorrect recommendation in 1 screening case, where in all 5 instances it failed to offer a PSA test in a patient in whom screening was recommended. In comparison, junior clinicians missed 16 tests in the closed-book format and 11 in the open-book format. They also offered 14 unnecessary PSA tests in the closed-book format and 10 in the open-book format. For follow-up cases, the RAG-LLM tool provided an incorrect recommendation in 1 case, where in all 5 instances it mistakenly offered a repeat PSA test in a patient with a normal PSA reading. In comparison, junior clinicians ordered 29 unnecessary tests in the closed-book format and 23 in the open-book format, and missed 24 tests and 13 tests in the closed- and open-book formats respectively. Overall, the RAG-LLM tool recommended 71 fewer unnecessary PSA tests than junior clinicians, and missed 59 fewer PSA tests which should have been offered. Results were further analyzed by the following categories of cases: (1) PSA screening recommended, (2) PSA screening not recommended, (3) Follow-up of a normal PSA reading, (4) Management/follow-up of an elevated PSA reading, and (5) Others, including likely spuriously elevated PSA readings from concurrent urinary tract infections, elevated PSA readings in patients with significant co-morbidity in whom further/repeat testing would be unlikely to be beneficial, and normal PSA readings in patients with an abnormal digital rectal examination. Results are detailed in Table 1 . Average time taken by clinicians to provide a recommendation was 23 seconds in the closed-book format and 28 seconds in an open-book format. In comparison, the RAG-LLM tool averaged 9.7 seconds per recommendation. Table 1 Results and breakdown of error categories Unnecessary Tests Missed tests Total Errors p Short interval Did not require Subtotal Long interval Failed to offer Subtotal LLM 5 0 5 0 5 5 10 (4.5%) - Human, closed-book 11 32 43 26 14 40 83 (37.7%) < 0.001 Human, open-book 10 23 33 14 10 24 57 (25.9%) < 0.001 Category 1: PSA screening recommended (11 cases) LLM 0 0 0 0 5 5 5 (9.1%) - Human, closed-book 0 0 0 3 10 13 13 (23.6%) 0.040 Human, open-book 0 0 0 1 10 11 11 (20.0%) 0.107 Category 2: PSA screening not recommended (9 cases) LLM 0 0 0 0 0 0 0 (0%) - Human, closed-book 0 14 14 3 0 3 17 (37.8%) < 0.001 Human, open-book 0 10 10 0 0 0 10 (22.2%) 0.001 Category 3: Normal PSA follow-up (9 cases) LLM 5 0 5 0 0 0 5 (11.1%) - Human, closed-book 8 8 16 0 3 3 19 (42.2%) 0.001 Human, open-book 7 7 14 1 0 1 15 (33.3%) 0.011 Category 4: Elevated PSA (8 cases) LLM 0 0 0 0 0 0 0 (0%) - Human, closed-book 0 0 0 20 1 21 21 (52.5%) < 0.001 Human, open-book 1 0 1 12 0 12 13 (28.9%) < 0.001 Category 5: Others (7 cases) LLM 0 0 0 0 0 0 0 (0%) - Human, closed-book 3 10 13 0 0 0 13 (37.1%) < 0.001 Human, open-book 2 6 8 0 0 0 8 (22.9%) 0.002 4. Discussion To our knowledge, this is the first study in the field of Urology demonstrating the efficacy of a retrieval-augmented generation LLM tool for clinical decision support. Augmenting large language models with contextualized information has been shown in other healthcare domains to reduce instances of hallucination and increase accuracy 9,10 . In this study, guideline concordant recommendations were made in > 95% of scenarios by the RAG-LLM, as compared to 60–75% concordance by junior clinicians. Examining responses which were not guideline concordant, we found that the errors made by the RAG-LLM arose from (1) the rule-based nature of the CCI calculator, which precluded a 72-year-old from PSA screening despite strong risk factors for prostate cancer, (2) erroneous interpretation of a normal PSA result as “moderately elevated”, triggering a reactive repeat PSA test which in actuality was unnecessary. In contrast, the junior clinicians made errors across a broad range of categories, irrespective of seniority or training status. In case scenarios where EAU and AUA guidelines provided differing recommendations for PSA testing intervals, the RAG-LLM tool provided both recommendations. In comparison, junior clinicians generally selected a single guideline document as a reference. While not incorrect, their responses were thus qualitatively less comprehensive and thorough than those generated by the LLM tool. Our study demonstrates that RAG-LLM tools have the potential to augment clinical decision-making by providing guideline-concordant recommendations in real-time. While such a clinical task may be relatively simple for an experienced specialist, generalists or junior clinicians may not necessarily have a similar familiarity and experience with specialist care. Such clinical decision support tools may prove useful in primary care settings, or in care settings where it is practically challenging for a senior clinician to supervise every clinical decision when times pressure and patient volume supervene. Patient-specific, guideline-based tools can potentially relieve cognitive burden, smoothen out learning curves, and improve decision-making time - thus improving overall consistency and efficiency of clinic consultations 11 . Use of RAG-LLM tools as a method to improve guideline adherence can also be a strategy to minimize unnecessary investigations and specialist consultation, thereby reducing costs to patients and public healthcare systems. In the primary care setting, increased adherence to guidelines has been shown to improve the quality and appropriateness of specialist referrals 12 . From a technical standpoint, RAG-LLM tools are preferable to “off-the-shelf” LLMs. The use of LLMs in clinical medicine engenders concerns of hallucination and resulting inaccurate recommendations, with implications for patient care and safety. Incorporating RAG systems in LLM tools reduces the frequency of hallucinations 13 and is more economical than fine-tuning or pre-training a model from the ground up. We acknowledge some important limitations to this study, which fall into the clinical and technical domains. From a clinical perspective, this study uses fictional case scenarios, rather than real clinical cases. However, it is arguably better to perform LLM evaluation on a well curated set of varied case scenarios, rather than a sample from a general population which would be less likely to feature uncommon or complex cases. This is analogous to assessment of junior clinicians, where ability would be assessed using a purposefully designed set of cases, rather than a general sample of common cases. A further clinical limitation is the use of the Charlson Comorbidity Index as a tool to estimate 10-year life expectancy. Although the CCI is recommended in the EAU guidelines as a means of estimating life expectancy, it was created in 1987 and in the modern day has certain limitations - such as an incomplete list of comorbidities, assumptions that the effect of comorbidities is additive, and potentially lengthier disease prognoses with modern medical management 14 . From a technical perspective, although supplementing LLMs with RAG has been shown to reduce rates of AI hallucinations 13 , these models are not altogether immune to hallucination. In addition, our RAG-LLM tool provided incorrect recommendations in a small minority of scenarios, but the reasons for it doing so are unclear, highlighting the ‘black box’ nature of many AI- or AI-assisted tools 15 . Use of techniques such as prompt engineering and self-reflective RAG models may help to enhance the accuracy of these models 16 . Variability in performance across different LLMs also needs to be taken into account, and balanced against the cost of each model. Despite these limitations, RAG-LLM tools retain potential for multiple applications in healthcare. Based on the same system for clinical decision support for guideline-based recommendations, it can also be used retrospectively as an auditing tool, to identify areas of guideline-discordance in clinical practice. Furthermore, the RAG approach allows future guideline documents to be incorporated much more easily than a fine-tuning or pre-training approach, keeping the tool up-to-date and preventing obsolescence 17 . Expanding such RAG-LLM pipelines beyond PSA testing to other areas in Urology and the eventual development of a comprehensive clinical decision support system are potential areas for further research. 5. Conclusion RAG-LLM tools can be used to create accurate and reliable real-time clinical decision support systems, outperforming junior clinicians in making efficient and guideline-concordant decisions. The use of these tools can help increase guideline adherence, improving patient care and optimizing the utilization of healthcare resources. Although the use of these tools come with challenges such as access to real-world datasets for machine learning, or susceptibility to AI hallucination, the use of RAG-LLM tools holds great application potential in improving patient-specific, evidence-based medicine. Declarations Conflicts of Interest The authors have no conflicts of interest to declare Funding: This study was supported by an academic medicine philanthropic fund (the Foo Keong Tatt Professorship in Urology) from the Singapore Health Services Duke-National University of Singapore ("SingHealth Duke-NUS") Joint Office of Academic Medicine Acknowledgements This study was supported by an academic medicine philanthropic fund (the Foo Keong Tatt Professorship in Urology) from the Singapore Health Services Duke-National University of Singapore ("SingHealth Duke-NUS") Joint Office of Academic Medicine. Data Availability Statement The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request. References Bray F, Laversanne M, Sung H, Ferlay J, Siegel RL, Soerjomataram I, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin . 2024;74(3):229-263. Etzioni R, Penson DF, Legler JM, di Tommaso D, Boer R, Gann PH, et al. Overdiagnosis due to prostate-specific antigen screening: lessons from U.S. prostate cancer incidence trends. J Natl Cancer Inst . 2002;94(13):981-990. Pinsky PF, Parnes HL, Andriole G. Mortality and complications after prostate biopsy in the Prostate, Lung, Colorectal and Ovarian Cancer Screening (PLCO) trial. BJU Int . 2014;113(2):254-259. Cornford P, van den Bergh RCN, Briers E, Van den Broeck T, Brunckhorst O, Darraugh J, et al. EAU-EANM-ESTRO-ESUR-ISUP-SIOG Guidelines on Prostate Cancer-2024 Update. Part I: Screening, Diagnosis, and Local Treatment with Curative Intent. Eur Urol . 2024;86(2):148-163. Wei JT, Barocas D, Carlsson S, Coakley F, Eggener S, Etzioni R, et al. Early Detection of Prostate Cancer: AUA/SUO Guideline Part I: Prostate Cancer Screening. J Urol . Published online 2023. doi:10.1097/JU.0000000000003491 Kalavacherla S, Riviere P, Javier-DesLoges J, Banegas MP, McKay RR, Murphy JD, et al. Low-Value Prostate-Specific Antigen Screening in Older Males. JAMA Netw Open . 2023;6(4):e237504. Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt JN, Laleh NG, et al. The future landscape of large language models in medicine. Communications Medicine . 2023;3(1):1-8. Wei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, et al. Chain-of-thought prompting elicits reasoning in large language models. In: Proceedings of the 36th International Conference on Neural Information Processing Systems . NIPS ’22. Curran Associates Inc.; 2024:24824-24837. Lim DYZ, Tan YB, Koh JTE, Tung JYM, Sng GGR, Tan DMY, et al. ChatGPT on guidelines: Providing contextual knowledge to GPT allows it to provide advice on appropriate colonoscopy intervals. J Gastroenterol Hepatol . 2024;39(1):81-106. Ge J, Sun S, Owens J, Galvez V, Gologorskaya O, Lai JC, et al. Development of a Liver Disease-Specific Large Language Model Chat Interface using Retrieval Augmented Generation. medRxiv . doi:10.1101/2023.11.10.23298364 Chen Z, Liang N, Zhang H, Li H, Yang Y, Zong X, et al. Review: Harnessing the power of clinical decision support systems: challenges and opportunities. Open Heart . 2023;10(2). doi:10.1136/openhrt-2023-002432 Blank L, Baxter S, Woods HB, Goyder E, Lee A, Payne N, et al. Referral interventions from primary to specialist care: a systematic review of international evidence. Br J Gen Pract . 2014;64(629):e765-e774. Gilbert S, Kather JN, Hogan A. Augmented non-hallucinating large language models as medical information curators. NPJ Digital Medicine . 2024;7. doi:10.1038/s41746-024-01081-0 Drosdowsky A, Gough K. The Charlson Comorbidity Index: problems with use in epidemiological research. J Clin Epidemiol . 2022;148:174-177. London AJ. Artificial Intelligence and Black-Box Medical Decisions: Accuracy versus Explainability. Hastings Cent Rep . 2019;49(1):15-21. Jeong M, Sohn J, Sung M, Kang J. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. Bioinformatics . 2024;40(Suppl 1):i119. Miao J, Thongprayoon C, Suppadungsuk S, Garcia Valencia OA, Cheungpasitporn W. Integrating retrieval-augmented generation with large language models in nephrology: Advancing practical applications. Medicina . 2024;60(3):445. Charlson ME, Pompei P, Ales KL, MacKenzie CR. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. J Chronic Dis . 1987;40(5). doi:10.1016/0021-9681(87)90171-8 Death and Life Expectancy. Base. http://www.singstat.gov.sg/find-data/search-by-theme/population/death-and-life-expectancy/latest-data. Accessed September 30, 2024 Additional Declarations No competing interests reported. Supplementary Files SupplementaryMaterial1.docx Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-5760861","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":433297820,"identity":"d8f0a998-ca6d-4050-b30c-aef097fb701d","order_by":0,"name":"Joshua Yi Min Tung","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA6UlEQVRIiWNgGAWjYBACxgYGNjj7AZDg4SNWiwQQMxuAtLDh1wAGcC1sEnAuPsDc3v7sMU8NQ53B7fZrlV9z7GTYGJgfPrqBz2E9Z8yNeY4xSBjcOVN2W3ZbMtBhbMbGOfi0zMhhkwYqkzC4kZN2W3IbM5DNwyaNX0v6M2mefxAtxZLb6onRkmAmzdsG0pJ+jPHjtsNEaOk5YyY5t09CcuaNHGZpxm3HediYCfjFEBhiEm++2fDz3Uh/+PHntmp7fvbmh4/xamkAU6AY4TFg5gGxmfEoBwF5BJP9AeMPAqpHwSgYBaNgZAIAICFBJchN6eoAAAAASUVORK5CYII=","orcid":"","institution":"Singapore General Hospital","correspondingAuthor":true,"prefix":"","firstName":"Joshua","middleName":"Yi Min","lastName":"Tung","suffix":""},{"id":433297821,"identity":"09e9ec01-0dff-4e46-8f4c-1bbedc28c8e0","order_by":1,"name":"Quan Le","email":"","orcid":"","institution":"Singapore General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Quan","middleName":"","lastName":"Le","suffix":""},{"id":433297822,"identity":"7d6ceed5-a4fa-41d3-b4c2-db1189569f74","order_by":2,"name":"Jinxuan Yao","email":"","orcid":"","institution":"Singapore General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Jinxuan","middleName":"","lastName":"Yao","suffix":""},{"id":433297823,"identity":"486886dd-114c-4f44-9a19-8f861ed6d19e","order_by":3,"name":"Yifei Huang","email":"","orcid":"","institution":"Singapore General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Yifei","middleName":"","lastName":"Huang","suffix":""},{"id":433297824,"identity":"e69c381f-468b-4a94-b3f9-173ece612eeb","order_by":4,"name":"Daniel Yan Zheng Lim","email":"","orcid":"","institution":"Singapore General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Daniel","middleName":"Yan Zheng","lastName":"Lim","suffix":""},{"id":433297825,"identity":"6465d691-8a9e-476c-b30b-27ea5f23c25f","order_by":5,"name":"Gerald Gui Ren Sng","email":"","orcid":"","institution":"Singapore General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Gerald","middleName":"Gui Ren","lastName":"Sng","suffix":""},{"id":433297826,"identity":"0d88fe06-5c91-4b45-b4df-f6ce7beb5231","order_by":6,"name":"Rachel Shu En Lau","email":"","orcid":"","institution":"Singapore General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Rachel","middleName":"Shu En","lastName":"Lau","suffix":""},{"id":433297827,"identity":"da062876-df4b-4cc9-8ef2-61862fd35f33","order_by":7,"name":"Yu Guang Tan","email":"","orcid":"","institution":"Singapore General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Yu","middleName":"Guang","lastName":"Tan","suffix":""},{"id":433297828,"identity":"1579047e-c2be-422a-8ab1-0b3a4215705e","order_by":8,"name":"Kenneth Chen","email":"","orcid":"","institution":"Singapore General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Kenneth","middleName":"","lastName":"Chen","suffix":""},{"id":433297829,"identity":"380089fd-6e18-474f-bfb2-b0dd6c553e4f","order_by":9,"name":"Kae Jack Tay","email":"","orcid":"","institution":"Singapore General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Kae","middleName":"Jack","lastName":"Tay","suffix":""},{"id":433297830,"identity":"2646418f-618e-4eb1-b7fb-9e35d75c7b52","order_by":10,"name":"Jen Hong Tan","email":"","orcid":"","institution":"Singapore General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Jen","middleName":"Hong","lastName":"Tan","suffix":""},{"id":433297831,"identity":"319f0953-c86f-4d06-96f6-37ae05a70059","order_by":11,"name":"John Shyi-Peng Yuen","email":"","orcid":"","institution":"Singapore General Hospital","correspondingAuthor":false,"prefix":"","firstName":"John","middleName":"Shyi-Peng","lastName":"Yuen","suffix":""},{"id":433297832,"identity":"7ce86adc-2459-40d9-8ee1-c107d291363b","order_by":12,"name":"Christopher Wai Sam Cheng","email":"","orcid":"","institution":"Singapore General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Christopher","middleName":"Wai Sam","lastName":"Cheng","suffix":""},{"id":433297833,"identity":"3f20051d-2be1-47f9-867f-2a549ab2a973","order_by":13,"name":"Henry Sun Sien Ho","email":"","orcid":"","institution":"Singapore General Hospital","correspondingAuthor":false,"prefix":"","firstName":"Henry","middleName":"Sun Sien","lastName":"Ho","suffix":""}],"badges":[],"createdAt":"2025-01-04 01:38:12","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-5760861/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-5760861/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":79551827,"identity":"1c8c32f3-bec0-46bf-8c5c-6cc0c072d87e","added_by":"auto","created_at":"2025-03-31 06:42:21","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":171902,"visible":true,"origin":"","legend":"\u003cp\u003eSchematic diagram for PSA Recommendation Pipeline\u003c/p\u003e","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-5760861/v1/d5dbb440478079737ef3ecc2.png"},{"id":79552455,"identity":"0025aad7-fcbc-4a54-b55c-63633d671d03","added_by":"auto","created_at":"2025-03-31 06:50:26","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1093354,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-5760861/v1/76a5731b-ca4b-4b1e-84ce-93644445ab87.pdf"},{"id":79550573,"identity":"eb0cdd91-83ba-4c39-ab1a-a540ac51877c","added_by":"auto","created_at":"2025-03-31 06:34:21","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":853470,"visible":true,"origin":"","legend":"","description":"","filename":"SupplementaryMaterial1.docx","url":"https://assets-eu.researchsquare.com/files/rs-5760861/v1/2443d2ce4be7e6117caa3b64.docx"}],"financialInterests":"No competing interests reported.","formattedTitle":"\u003cp\u003eRetrieval-Augmented-Generation large language models outperform junior clinicians in guideline-concordant PSA testing\u003c/p\u003e","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eProstate cancer is the second most commonly diagnosed cancer and the fifth leading cause of cancer-related death among men globally\u003csup\u003e1\u003c/sup\u003e. Screening for prostate cancer is thus a common issue in both primary and specialist care settings. Prostate-specific antigen (PSA) testing is the most widely used method for early detection, but remains a controversial issue in urological literature, largely owing to the harms associated with over-diagnosis and over-treatment\u003csup\u003e2,3\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003e Society guidelines for prostate cancer screening via PSA testing serve to streamline and standardize patient care, and are often utilized by trainees, junior staff, or non-specialist medical practitioners to guide medical decision-making. Such guidelines have been issued by various organizations such as the European Association of Urology (EAU)\u003csup\u003e4\u003c/sup\u003e and American Urological Association (AUA)\u003csup\u003e5\u003c/sup\u003e, but discrepancies between these guidelines \u0026ndash; such as recommendations on whether PSA screening should be offered, the appropriate patient populations, and screening intervals \u0026ndash; pose challenges for clinical decision making. These are further complicated by the need to consider other patient factors, such as the need to calculate estimated life-expectancy (as many guidelines do not recommend PSA screening in patients with a\u0026thinsp;\u0026lt;\u0026thinsp;10- or \u0026lt;\u0026thinsp;15-year life-expectancy), and the need to consider the patient\u0026rsquo;s own preferences. Shared decision making (SDM) forms a key component in both the EAU and AUA guidelines, particularly in older men, or those with multiple medical comorbidities.\u003c/p\u003e \u003cp\u003e The current EAU-EANM-ESTRO-ESUR-ISUP-SIOG and AUA/SUO guidelines on prostate cancer and early detection of prostate cancer stand at 239 and 47 pages respectively. Appropriate decision-making and adherence to guidelines is therefore a time-consuming and challenging task for non-specialists in a primary care setting, and even in the specialist-outpatient setting where time constraints are common. Prior studies have shown a low rate of compliance to organizational guidelines, such as a cohort study of 32,306 males showing that 40% of men in the \u0026gt;\u0026thinsp;80-year age group in a cohort received inappropriate PSA screening\u003csup\u003e6\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003e One potential solution to this problem is by using AI to parse guidelines and deliver an appropriate recommendation. Large language models (LLMs) are a form of artificial intelligence (AI) that are trained on large amounts of text data, and hence have the capability to process unstructured text inputs and generate appropriate responses. They can thus be applied in healthcare, such as in patient communications, education, and clinical risk stratification\u003csup\u003e7\u003c/sup\u003e. However, general LLMs such as the GPT models developed by OpenAI are not specifically designed for healthcare use, and can produce inaccurate or misleading information. They have a knowledge cut-off based on the recency of the underlying training data, for example January 2022 for the current generation of OpenAI GPT models. To address these limitations, retrieval-augmented generation (RAG) techniques have been developed to enhance the accuracy of LLMs. RAG directs the LLM to answer a given scenario by reference to an additional database of curated information, such as a set of guidelines. By \u0026lsquo;grounding\u0026rsquo; the responses using relevant information from the database, LLMs can overcome their intrinsic knowledge cut-off, and produce responses with less hallucination.\u003c/p\u003e \u003cp\u003e The aim of this study was thus to evaluate the accuracy of a RAG-enabled LLM model, which had been grounded in the EAU and AUA guidelines pertaining to prostate cancer screening.\u003c/p\u003e"},{"header":"2. Methods","content":"\u003cp\u003eThis study was conducted in a simulated environment using only fictional patient data. As the use of fictional data does not fall under local Human Biomedical Research Act regulations, ethical approval was not required.\u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003e2.1. Development of case scenarios\u003c/h2\u003e \u003cp\u003eA series of 44 fictional case scenarios were developed to reflect a range of clinical presentations at an outpatient clinic setting. These free text scenarios included fictional patient biodata such as age, medical comorbidities, presence or absence of urological symptoms (such as hematuria or lower urinary tract symptoms, if any), and prior PSA readings (if applicable). These were written by a urology fellow with 8 years of clinical experience, supervised by 2 urology consultants with \u0026gt;\u0026thinsp;20 years of clinical experience each.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e2.2. Development of RAG-enabled LLM model\u003c/h2\u003e \u003cp\u003eWe developed an automated pipeline to process case scenarios based on how a healthcare provider would provide a PSA testing recommendation. The schematic diagram is shown in Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003eKey components of this pipeline include:\u003c/p\u003e \u003cp\u003e\u003cul\u003e\u003cli\u003e\u003cp\u003eAn LLM-based calculator to extract relevant patient information (age and comorbidities) from the case scenario, in order to calculate the Charlson Comorbidity Index (CCI) and thereby estimate the expected 10-year life expectancy. Patients who are not expected to live at least 10 years are not recommended for PSA screening\u003csup\u003e4,5\u003c/sup\u003e, and the pipeline did not allow such case scenarios to proceed. Likewise, scenarios where the patient was \u0026gt;\u0026thinsp;72 years of age were also not permitted to proceed. We provide further technical details of the CCI calculator in \u003cb\u003eAppendix 1\u003c/b\u003e.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eFor patients with at least 10-year life expectancy based on CCI scores, a RAG-enabled LLM was used to provide a recommendation based on the given case scenario. In comparison with standard \u0026ldquo;off-the-shelf\u0026rdquo; LLMs which are not trained on domain-specific medical information, RAG allows the LLM to reference a fixed set of material, such as the relevant EAU and AUA society guidelines in this study. Language models augmented in this way with contextualized information can overcome their intrinsic knowledge deficits, and reduce hallucination by constraining the response on the given information.\u003c/p\u003e\u003c/li\u003e\u003cli\u003e\u003cp\u003eSince the AUA \u0026amp; EAU guidelines occasionally provide different and non-overlapping recommendations, separate answers were generated from each set of guidelines first, before combining them for the final recommendations.\u003c/p\u003e\u003c/li\u003e\u003c/ul\u003e\u003c/p\u003e \u003cp\u003eWe provide further technical details of the RAG-enabled LLM in \u003cb\u003eAppendix 2\u003c/b\u003e. These include explanations of modern RAG techniques applied to optimize performance, such as context filtering to improve retrieval of relevant information, and advanced prompting methods (Chain-Of-Thought reasoning\u003csup\u003e8\u003c/sup\u003e, constraining answers to be based on retrieved information, providing example output structure, and use of an expert clinician persona). The full RAG prompt can be found in \u003cb\u003eAppendix 3\u003c/b\u003e.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e2.3. Relevant Software\u003c/h2\u003e \u003cp\u003eThe RAG prototype was developed with Python 3.10. Vector databases were constructed using Unstructured API for ingestion of PDF documents, OpenAI API for generation of text embeddings and Qdrant as the vector database. For LLM calls, we used both OpenAI and Anthropic APIs for different components in our pipeline. We used both LlamaIndex and Langchain for orchestration, with LlamaIndex handling retrieval of augmented generation components, whereas Langchain was used for structured data extraction and connecting pipeline components.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec6\" class=\"Section2\"\u003e \u003ch2\u003e2.4. Answer Generation \u0026amp; Grading\u003c/h2\u003e \u003cp\u003eFive junior clinicians were tasked to provide recommendations on PSA testing for each of the case scenarios. They comprised a mix of medical officers, urology junior residents, and family medicine residents. Each clinician completed the task in a \u0026ldquo;closed book\u0026rdquo; format, followed by an \u0026ldquo;open book\u0026rdquo; format in which they were permitted to reference relevant material of their choice (guidelines, textbooks). The time taken to complete the task in each format was recorded.\u003c/p\u003e \u003cp\u003eThe RAG-LLM tool was likewise provided with the same set of fictional case scenarios and instructed to provide recommendations on PSA testing. We conducted 5 runs to assess the consistency of the LLM output. Answers were graded by the study team in a binomial format (correct/incorrect). Answers were marked as correct if they were concordant with either the EAU or AUA guidelines.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e2.5. Statistical Analysis\u003c/h2\u003e \u003cp\u003eSPSS version 26.0 (IBM Corp) was used for the statistical analysis of quantitative data. Answers from the RAG-LLM tool and human comparators were compared using Student\u0026rsquo;s \u003cem\u003et\u003c/em\u003e-test.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"3. Results","content":"\u003cp\u003eThe RAG-LLM tool provided guideline-concordant recommendations in 95.5% of case scenarios, compared to junior clinicians, who were correct in 62.3% of scenarios in a closed-book format, and 74.1% of scenarios in an open book format. The difference was statistically significant for both closed-book (\u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.001) and open-book (\u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.001) formats.\u003c/p\u003e \u003cp\u003eCases were divided into screening (20 cases) and follow-up (24 cases) categories. The RAG-LLM tool provided an incorrect recommendation in 1 screening case, where in all 5 instances it failed to offer a PSA test in a patient in whom screening was recommended. In comparison, junior clinicians missed 16 tests in the closed-book format and 11 in the open-book format. They also offered 14 unnecessary PSA tests in the closed-book format and 10 in the open-book format. For follow-up cases, the RAG-LLM tool provided an incorrect recommendation in 1 case, where in all 5 instances it mistakenly offered a repeat PSA test in a patient with a normal PSA reading. In comparison, junior clinicians ordered 29 unnecessary tests in the closed-book format and 23 in the open-book format, and missed 24 tests and 13 tests in the closed- and open-book formats respectively. Overall, the RAG-LLM tool recommended 71 fewer unnecessary PSA tests than junior clinicians, and missed 59 fewer PSA tests which should have been offered.\u003c/p\u003e \u003cp\u003eResults were further analyzed by the following categories of cases: (1) PSA screening recommended, (2) PSA screening not recommended, (3) Follow-up of a normal PSA reading, (4) Management/follow-up of an elevated PSA reading, and (5) Others, including likely spuriously elevated PSA readings from concurrent urinary tract infections, elevated PSA readings in patients with significant co-morbidity in whom further/repeat testing would be unlikely to be beneficial, and normal PSA readings in patients with an abnormal digital rectal examination. Results are detailed in Table\u0026nbsp;\u003cspan refid=\"Tab1\" class=\"InternalRef\"\u003e1\u003c/span\u003e.\u003c/p\u003e \u003cp\u003eAverage time taken by clinicians to provide a recommendation was 23 seconds in the closed-book format and 28 seconds in an open-book format. In comparison, the RAG-LLM tool averaged 9.7 seconds per recommendation.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eResults and breakdown of error categories\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"9\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c9\" colnum=\"9\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c4\" namest=\"c2\"\u003e \u003cp\u003eUnnecessary Tests\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colspan=\"3\" nameend=\"c7\" namest=\"c5\"\u003e \u003cp\u003eMissed tests\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eTotal Errors\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c9\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cem\u003ep\u003c/em\u003e\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e\u0026nbsp;\u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eShort interval\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eDid not require\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eSubtotal\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eLong interval\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eFailed to offer\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eSubtotal\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eLLM\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e10 (4.5%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHuman, closed-book\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e32\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e43\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e26\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e40\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e83 (37.7%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHuman, open-book\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e23\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e33\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e24\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e57 (25.9%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"9\" nameend=\"c9\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eCategory 1: PSA screening recommended (11 cases)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eLLM\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e5 (9.1%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHuman, closed-book\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e13\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e13 (23.6%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.040\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHuman, open-book\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e11\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e11 (20.0%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.107\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"9\" nameend=\"c9\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eCategory 2: PSA screening not recommended (9 cases)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eLLM\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0 (0%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHuman, closed-book\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e17 (37.8%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHuman, open-book\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e10 (22.2%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"9\" nameend=\"c9\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eCategory 3: Normal PSA follow-up (9 cases)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eLLM\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e5\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e5 (11.1%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHuman, closed-book\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e16\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e19 (42.2%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHuman, open-book\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e7\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e14\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e15 (33.3%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.011\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"9\" nameend=\"c9\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eCategory 4: Elevated PSA (8 cases)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eLLM\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0 (0%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHuman, closed-book\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e21\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e21 (52.5%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHuman, open-book\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e12\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e13 (28.9%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colspan=\"9\" nameend=\"c9\" namest=\"c1\"\u003e \u003cp\u003e\u003cb\u003eCategory 5: Others (7 cases)\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eLLM\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e0 (0%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e-\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHuman, closed-book\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e10\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e13\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e13 (37.1%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;0.001\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHuman, open-book\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e6\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e8\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c7\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e8 (22.9%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c9\"\u003e \u003cp\u003e0.002\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e"},{"header":"4. Discussion","content":"\u003cp\u003eTo our knowledge, this is the first study in the field of Urology demonstrating the efficacy of a retrieval-augmented generation LLM tool for clinical decision support. Augmenting large language models with contextualized information has been shown in other healthcare domains to reduce instances of hallucination and increase accuracy\u003csup\u003e9,10\u003c/sup\u003e. In this study, guideline concordant recommendations were made in \u0026gt;\u0026thinsp;95% of scenarios by the RAG-LLM, as compared to 60\u0026ndash;75% concordance by junior clinicians.\u003c/p\u003e \u003cp\u003e Examining responses which were not guideline concordant, we found that the errors made by the RAG-LLM arose from (1) the rule-based nature of the CCI calculator, which precluded a 72-year-old from PSA screening despite strong risk factors for prostate cancer, (2) erroneous interpretation of a normal PSA result as \u0026ldquo;moderately elevated\u0026rdquo;, triggering a reactive repeat PSA test which in actuality was unnecessary. In contrast, the junior clinicians made errors across a broad range of categories, irrespective of seniority or training status.\u003c/p\u003e \u003cp\u003e In case scenarios where EAU and AUA guidelines provided differing recommendations for PSA testing intervals, the RAG-LLM tool provided both recommendations. In comparison, junior clinicians generally selected a single guideline document as a reference. While not incorrect, their responses were thus qualitatively less comprehensive and thorough than those generated by the LLM tool.\u003c/p\u003e \u003cp\u003e Our study demonstrates that RAG-LLM tools have the potential to augment clinical decision-making by providing guideline-concordant recommendations in real-time. While such a clinical task may be relatively simple for an experienced specialist, generalists or junior clinicians may not necessarily have a similar familiarity and experience with specialist care. Such clinical decision support tools may prove useful in primary care settings, or in care settings where it is practically challenging for a senior clinician to supervise every clinical decision when times pressure and patient volume supervene. Patient-specific, guideline-based tools can potentially relieve cognitive burden, smoothen out learning curves, and improve decision-making time - thus improving overall consistency and efficiency of clinic consultations\u003csup\u003e11\u003c/sup\u003e. Use of RAG-LLM tools as a method to improve guideline adherence can also be a strategy to minimize unnecessary investigations and specialist consultation, thereby reducing costs to patients and public healthcare systems. In the primary care setting, increased adherence to guidelines has been shown to improve the quality and appropriateness of specialist referrals\u003csup\u003e12\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eFrom a technical standpoint, RAG-LLM tools are preferable to \u0026ldquo;off-the-shelf\u0026rdquo; LLMs. The use of LLMs in clinical medicine engenders concerns of hallucination and resulting inaccurate recommendations, with implications for patient care and safety. Incorporating RAG systems in LLM tools reduces the frequency of hallucinations\u003csup\u003e13\u003c/sup\u003e and is more economical than fine-tuning or pre-training a model from the ground up.\u003c/p\u003e \u003cp\u003eWe acknowledge some important limitations to this study, which fall into the clinical and technical domains. From a clinical perspective, this study uses fictional case scenarios, rather than real clinical cases. However, it is arguably better to perform LLM evaluation on a well curated set of varied case scenarios, rather than a sample from a general population which would be less likely to feature uncommon or complex cases. This is analogous to assessment of junior clinicians, where ability would be assessed using a purposefully designed set of cases, rather than a general sample of common cases.\u003c/p\u003e \u003cp\u003eA further clinical limitation is the use of the Charlson Comorbidity Index as a tool to estimate 10-year life expectancy. Although the CCI is recommended in the EAU guidelines as a means of estimating life expectancy, it was created in 1987 and in the modern day has certain limitations - such as an incomplete list of comorbidities, assumptions that the effect of comorbidities is additive, and potentially lengthier disease prognoses with modern medical management\u003csup\u003e14\u003c/sup\u003e.\u003c/p\u003e \u003cp\u003eFrom a technical perspective, although supplementing LLMs with RAG has been shown to reduce rates of AI hallucinations\u003csup\u003e13\u003c/sup\u003e, these models are not altogether immune to hallucination. In addition, our RAG-LLM tool provided incorrect recommendations in a small minority of scenarios, but the reasons for it doing so are unclear, highlighting the \u0026lsquo;black box\u0026rsquo; nature of many AI- or AI-assisted tools\u003csup\u003e15\u003c/sup\u003e. Use of techniques such as prompt engineering and self-reflective RAG models may help to enhance the accuracy of these models\u003csup\u003e16\u003c/sup\u003e. Variability in performance across different LLMs also needs to be taken into account, and balanced against the cost of each model.\u003c/p\u003e \u003cp\u003eDespite these limitations, RAG-LLM tools retain potential for multiple applications in healthcare. Based on the same system for clinical decision support for guideline-based recommendations, it can also be used retrospectively as an auditing tool, to identify areas of guideline-discordance in clinical practice. Furthermore, the RAG approach allows future guideline documents to be incorporated much more easily than a fine-tuning or pre-training approach, keeping the tool up-to-date and preventing obsolescence\u003csup\u003e17\u003c/sup\u003e. Expanding such RAG-LLM pipelines beyond PSA testing to other areas in Urology and the eventual development of a comprehensive clinical decision support system are potential areas for further research.\u003c/p\u003e"},{"header":"5. Conclusion","content":"\u003cp\u003e RAG-LLM tools can be used to create accurate and reliable real-time clinical decision support systems, outperforming junior clinicians in making efficient and guideline-concordant decisions. The use of these tools can help increase guideline adherence, improving patient care and optimizing the utilization of healthcare resources. Although the use of these tools come with challenges such as access to real-world datasets for machine learning, or susceptibility to AI hallucination, the use of RAG-LLM tools holds great application potential in improving patient-specific, evidence-based medicine.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eConflicts of Interest\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors have no conflicts of interest to declare\u003c/p\u003e\n\u003cp\u003eFunding: This study was supported by an academic medicine philanthropic fund (the Foo Keong Tatt Professorship in Urology) from the Singapore Health Services Duke-National University of Singapore (\u0026quot;SingHealth Duke-NUS\u0026quot;) Joint Office of Academic Medicine\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003e\u0026nbsp;Acknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was supported by an academic medicine philanthropic fund (the Foo Keong Tatt Professorship in Urology) from the Singapore Health Services Duke-National University of Singapore (\u0026quot;SingHealth Duke-NUS\u0026quot;) Joint Office of Academic Medicine.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Availability Statement\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n\u003cli\u003eBray F, Laversanne M, Sung H, Ferlay J, Siegel RL, Soerjomataram I, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. \u003cem\u003eCA Cancer J Clin\u003c/em\u003e. 2024;74(3):229-263.\u003c/li\u003e\n\u003cli\u003eEtzioni R, Penson DF, Legler JM, di Tommaso D, Boer R, Gann PH, et al. Overdiagnosis due to prostate-specific antigen screening: lessons from U.S. prostate cancer incidence trends. \u003cem\u003eJ Natl Cancer Inst\u003c/em\u003e. 2002;94(13):981-990.\u003c/li\u003e\n\u003cli\u003ePinsky PF, Parnes HL, Andriole G. Mortality and complications after prostate biopsy in the Prostate, Lung, Colorectal and Ovarian Cancer Screening (PLCO) trial. \u003cem\u003eBJU Int\u003c/em\u003e. 2014;113(2):254-259.\u003c/li\u003e\n\u003cli\u003eCornford P, van den Bergh RCN, Briers E, Van den Broeck T, Brunckhorst O, Darraugh J, et al. EAU-EANM-ESTRO-ESUR-ISUP-SIOG Guidelines on Prostate Cancer-2024 Update. Part I: Screening, Diagnosis, and Local Treatment with Curative Intent. \u003cem\u003eEur Urol\u003c/em\u003e. 2024;86(2):148-163.\u003c/li\u003e\n\u003cli\u003eWei JT, Barocas D, Carlsson S, Coakley F, Eggener S, Etzioni R, et al. Early Detection of Prostate Cancer: AUA/SUO Guideline Part I: Prostate Cancer Screening. \u003cem\u003eJ Urol\u003c/em\u003e. Published online 2023. doi:10.1097/JU.0000000000003491\u003c/li\u003e\n\u003cli\u003eKalavacherla S, Riviere P, Javier-DesLoges J, Banegas MP, McKay RR, Murphy JD, et al. Low-Value Prostate-Specific Antigen Screening in Older Males. \u003cem\u003eJAMA Netw Open\u003c/em\u003e. 2023;6(4):e237504.\u003c/li\u003e\n\u003cli\u003eClusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt JN, Laleh NG, et al. The future landscape of large language models in medicine. \u003cem\u003eCommunications Medicine\u003c/em\u003e. 2023;3(1):1-8.\u003c/li\u003e\n\u003cli\u003eWei J, Wang X, Schuurmans D, Bosma M, Ichter B, Xia F, et al. Chain-of-thought prompting elicits reasoning in large language models. In: \u003cem\u003eProceedings of the 36th International Conference on Neural Information Processing Systems\u003c/em\u003e. NIPS \u0026rsquo;22. Curran Associates Inc.; 2024:24824-24837.\u003c/li\u003e\n\u003cli\u003eLim DYZ, Tan YB, Koh JTE, Tung JYM, Sng GGR, Tan DMY, et al. ChatGPT on guidelines: Providing contextual knowledge to GPT allows it to provide advice on appropriate colonoscopy intervals. \u003cem\u003eJ Gastroenterol Hepatol\u003c/em\u003e. 2024;39(1):81-106.\u003c/li\u003e\n\u003cli\u003eGe J, Sun S, Owens J, Galvez V, Gologorskaya O, Lai JC, et al. Development of a Liver Disease-Specific Large Language Model Chat Interface using Retrieval Augmented Generation. \u003cem\u003emedRxiv\u003c/em\u003e. doi:10.1101/2023.11.10.23298364\u003c/li\u003e\n\u003cli\u003eChen Z, Liang N, Zhang H, Li H, Yang Y, Zong X, et al. Review: Harnessing the power of clinical decision support systems: challenges and opportunities. \u003cem\u003eOpen Heart\u003c/em\u003e. 2023;10(2). doi:10.1136/openhrt-2023-002432\u003c/li\u003e\n\u003cli\u003eBlank L, Baxter S, Woods HB, Goyder E, Lee A, Payne N, et al. Referral interventions from primary to specialist care: a systematic review of international evidence. \u003cem\u003eBr J Gen Pract\u003c/em\u003e. 2014;64(629):e765-e774.\u003c/li\u003e\n\u003cli\u003eGilbert S, Kather JN, Hogan A. Augmented non-hallucinating large language models as medical information curators. \u003cem\u003eNPJ Digital Medicine\u003c/em\u003e. 2024;7. doi:10.1038/s41746-024-01081-0\u003c/li\u003e\n\u003cli\u003eDrosdowsky A, Gough K. The Charlson Comorbidity Index: problems with use in epidemiological research. \u003cem\u003eJ Clin Epidemiol\u003c/em\u003e. 2022;148:174-177.\u003c/li\u003e\n\u003cli\u003eLondon AJ. Artificial Intelligence and Black-Box Medical Decisions: Accuracy versus Explainability. \u003cem\u003eHastings Cent Rep\u003c/em\u003e. 2019;49(1):15-21.\u003c/li\u003e\n\u003cli\u003eJeong M, Sohn J, Sung M, Kang J. Improving medical reasoning through retrieval and self-reflection with retrieval-augmented large language models. \u003cem\u003eBioinformatics\u003c/em\u003e. 2024;40(Suppl 1):i119.\u003c/li\u003e\n\u003cli\u003eMiao J, Thongprayoon C, Suppadungsuk S, Garcia Valencia OA, Cheungpasitporn W. Integrating retrieval-augmented generation with large language models in nephrology: Advancing practical applications. \u003cem\u003eMedicina \u003c/em\u003e. 2024;60(3):445.\u003c/li\u003e\n\u003cli\u003eCharlson ME, Pompei P, Ales KL, MacKenzie CR. A new method of classifying prognostic comorbidity in longitudinal studies: development and validation. \u003cem\u003eJ Chronic Dis\u003c/em\u003e. 1987;40(5). doi:10.1016/0021-9681(87)90171-8\u003c/li\u003e\n\u003cli\u003eDeath and Life Expectancy. Base. http://www.singstat.gov.sg/find-data/search-by-theme/population/death-and-life-expectancy/latest-data. Accessed September 30, 2024\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"Artificial intelligence, large language models, retrieval augmented generation","lastPublishedDoi":"10.21203/rs.3.rs-5760861/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-5760861/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003ch2\u003eBackground and Objective\u003c/h2\u003e \u003cp\u003e Society guidelines for prostate cancer screening via PSA testing serve to standardize patient care, and are often utilized by trainees, junior staff, or generalist medical practitioners to guide medical decision-making. Adherence to guidelines is a time-consuming and challenging task and rates of inappropriate PSA testing are high. This study evaluates a retrieval-augmented generation (RAG) enhanced large language model (LLM), grounded in current EAU and AUA guidelines, to assess its effectiveness in providing guideline-concordant PSA screening recommendations compared to junior clinicians.\u003c/p\u003e\u003ch2\u003eMethods\u003c/h2\u003e \u003cp\u003eA retrieval-augmented generation (RAG) pipeline was developed and used to process a series of 44 fictional case scenarios. Five junior clinicians were tasked to provide PSA testing recommendations for the same scenarios, in closed-book and open-book formats. Answers were compared for accuracy in a binomial fashion.\u003c/p\u003e\u003ch2\u003eKey Findings and Limitations\u003c/h2\u003e \u003cp\u003e The RAG-LLM tool provided guideline-concordant recommendations in 95.5% of case scenarios, compared to junior clinicians, who were correct in 62.3% of scenarios in a closed-book format, and 74.1% of scenarios in an open book format. The difference was statistically significant for both closed-book (\u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.001) and open-book (\u003cem\u003ep\u003c/em\u003e\u0026thinsp;\u0026lt;\u0026thinsp;0.001) formats.\u003c/p\u003e\u003ch2\u003eConclusions and Clinical Implications\u003c/h2\u003e \u003cp\u003e Use of RAG techniques allows LLMs to integrate complex guidelines into day-to-day medical decision-making. RAG-LLM tools in Urology have the capability to enhance clinical decision-making by providing guideline-concordant recommendations for PSA testing, potentially improving the consistency of healthcare delivery, reducing cognitive load on clinicians, and reducing unnecessary investigations and costs.\u003c/p\u003e","manuscriptTitle":"Retrieval-Augmented-Generation large language models outperform junior clinicians in guideline-concordant PSA testing","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-03-31 06:34:16","doi":"10.21203/rs.3.rs-5760861/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"9623ffe6-8185-4f31-91ba-0f0559ef1190","owner":[],"postedDate":"March 31st, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":46227409,"name":"Health sciences/Urology"},{"id":46227410,"name":"Health sciences/Medical research"}],"tags":[],"updatedAt":"2025-03-31T06:34:16+00:00","versionOfRecord":[],"versionCreatedAt":"2025-03-31 06:34:16","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-5760861","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-5760861","identity":"rs-5760861","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-23T02:00:01.238055+00:00

License: CC-BY-4.0