Causal-RAG: Causally-Augmented Retrieval for Hallucination-Free Clinical Decision Support in Low-Resource Settings

doi:10.21203/rs.3.rs-8163345/v1

Causal-RAG: Causally-Augmented Retrieval for Hallucination-Free Clinical Decision Support in Low-Resource Settings

2025 · doi:10.21203/rs.3.rs-8163345/v1

preprint OA: closed

Full text JSON View at publisher

Full text 89,090 characters · extracted from preprint-html · click to expand

Causal-RAG: Causally-Augmented Retrieval for Hallucination-Free Clinical Decision Support in Low-Resource Settings | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article Causal-RAG: Causally-Augmented Retrieval for Hallucination-Free Clinical Decision Support in Low-Resource Settings Nnaemeka Kingsley Ugwumba, Peter Sunday Jaja This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8163345/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract This research addresses a critical challenge in using artificial intelligence for healthcare in low-resource settings: the tendency of AI models to produce confident but incorrect information, a phenomenon known as hallucination. We propose Causal-RAG, a novel framework that enhances standard Retrieval-Augmented Generation (RAG) by integrating principles of causal inference. The goal is to ground the AI's responses in robust, causally-relevant evidence rather than mere correlations. We built a prototype and tested it on a clinical question-answering task. Our findings reveal a fundamental trade-off: while a standard RAG system achieved high accuracy but displayed a dangerous 'yes' bias, our Causal-RAG approach successfully reduced this overconfidence, prioritizing safety. This work establishes a foundation for developing more trustworthy and reliable AI decision-support tools for clinical environments where data and expertise are scarce. Artificial Intelligence and Machine Learning clinical artificial intelligence retrieval-augmented generation causal inference medical natural language processing AI safety healthcare AI medical decision support low-resource clinical settings hallucination reduction evidence-based AI confidence calibration clinical question answering Figures Figure 1 Figure 2 Figure 3 1. Introduction The adoption of Artificial Intelligence (AI) in clinical decision support holds immense promise for improving healthcare delivery, especially in low-resource settings where expert medical personnel are scarce. Large Language Models (LLMs) can potentially bridge this gap by providing instant access to medical knowledge. However, their practical application is severely hampered by a well-documented tendency to "hallucinate": that is, to generate plausible-sounding but factually incorrect or unsupported information. This is not merely an academic concern; in a clinical context, an overconfident and incorrect AI recommendation can directly impact patient safety and outcomes. Retrieval-Augmented Generation (RAG) has emerged as a primary technique to mitigate hallucinations by grounding an LLM's responses in retrieved evidence from external knowledge bases, such as medical textbooks or research papers. While effective, standard RAG systems have a critical weakness: they retrieve information based on semantic similarity but lack the ability to judge the quality or nature of the evidence they find. A text describing a spurious correlation can be retrieved with high confidence, leading the LLM to generate a confident but causally unfounded conclusion. This is analogous to a medical student who can recall facts but cannot critically appraise a clinical study. In this research, we posit that the key to building more trustworthy clinical AI lies in moving beyond semantic retrieval to evidence-aware retrieval. We specifically focus on causal evidence. In medicine, establishing causality: for instance, that a drug causes an improvement in symptoms, not just that the two are associated, is the gold standard for evidence-based practice. To this end, we introduce Causal-RAG, a novel framework designed to enhance the reliability of clinical AI for low-resource settings. The core idea is to augment the standard RAG retrieval process with a causal relevance layer. This layer prioritizes documents that contain language indicative of causal relationships (e.g., from randomized controlled trials) over those that merely report correlations. In this paper, we detail the implementation of a Causal-RAG prototype. We began by establishing a strong baseline using a standard RAG system on a dataset of clinical questions. We then developed and iterated on several causal augmentation strategies. Our results are revealing: the baseline RAG system, while achieving high nominal accuracy, exhibited a near-total "yes" bias, answering affirmatively to almost every question. This demonstrates a critical form of overconfidence. Our subsequent Causal-RAG implementations successfully broke this bias, making the system more conservative and evidence-based, albeit at an initial cost to accuracy. This trade-off between accuracy and safety is a central finding of our work, highlighting a crucial consideration for the deployment of AI in real-world clinical environments. The main goal of this research is to build and test this Causal-RAG concept. Our specific objectives are: To build a standard RAG system for clinical questions and identify its specific weaknesses, especially its tendency to be overconfident. To design and implement a new retrieval method that can find and prioritize causally-sound medical evidence. To test our Causal-RAG system against the standard one, measuring not only which is more accurate, but more importantly, which is safer and less prone to dangerous hallucinations. To analyze the trade-off between being highly confident and being correct, providing a foundation for future, more reliable clinical AI tools. This approach aims to create AI that is not just smart, but also trustworthy and safe enough to be used in real-world clinical environments where the margin for error is zero. 2. Related Work This research sits at the intersection of clinical natural language processing, retrieval-augmented generation, and causal inference. A thorough review of these domains is essential to contextualize our contribution. 2.1 Clinical Language Models and Their Limitations The development of domain-specific language models has been a cornerstone of medical AI research. BioBERT (Lee et al., 2019 ) was a landmark achievement, demonstrating that pre-training BERT on large-scale biomedical corpora (PubMed abstracts and PMC full-text articles) significantly improved performance on tasks like named entity recognition and relation extraction. ClinicalBERT (Alsentzer et al., 2019) extended this approach by focusing on clinical notes from the MIMIC-III database, tailoring the model to the specific language and abbreviations used in electronic health records. These models represented a substantial leap over general-purpose LLMs in understanding medical terminology. More recently, the field has witnessed the emergence of generative clinical models. Med-PaLM (Singhal et al., 2022) was one of the first LLMs to reach expert-level performance on U.S. Medical Licensing Examination (USMLE)-style questions, showcasing remarkable medical knowledge recall. Its successor, Med-PaLM 2 (Singhal et al., 2023 ), further pushed the boundaries, emphasizing the model's ability to provide reasoning and align with medical consensus. However, a critical and persistent limitation across all these models, explicitly acknowledged by their creators, is the propensity for hallucination. The Med-PaLM 2 paper notes that while their model achieved high accuracy, a significant portion of its remaining errors were "hallucinations or non-factual statements." This is not a problem unique to these models; it is a fundamental issue with the auto-regressive nature of LLMs, which are trained to generate plausible text sequences rather than to be factually correct. In a low-resource clinical setting, where a clinician may lack the expertise to fact-check the AI, this limitation becomes a critical safety hazard. Our work directly addresses this by building a safeguard against such confabulation. Beyond clinical applications, the integration of multiple data modalities for enhanced detection has shown success in other security-critical domains. For instance, Ugwumba ( 2025 ) proposed SocialGuard, a framework for fake account detection that combines behavioral API monitoring with malicious URL profiling. Their approach of extracting hybrid features from user activity and external patterns, achieving 99.51% accuracy, demonstrates the power of multi-modal feature integration for complex classification tasks. While our domain differs, this work inspires our approach of combining semantic retrieval with causal evidence patterns to create a more robust detection system for clinical hallucinations 2.2 Retrieval-Augmented Generation (RAG) in Medicine Retrieval-Augmented Generation was introduced by Lewis et al. ( 2020 ) as a general framework to mitigate hallucination and knowledge-update issues in LLMs by grounding generation in retrieved evidence. The paradigm involves a retriever module (often a dense passage retriever like DPR) that fetches relevant documents from a knowledge corpus, and a generator module (a seq2seq model) that produces an answer conditioned on both the query and the retrieved passages. The application of RAG to the medical domain has been an active area of research. For instance, Agrawal et al. (2022) explored RAG for open-domain medical question answering, using a corpus of medical textbooks and research articles. Their work demonstrated that RAG could significantly improve the factuality of responses compared to a generative-only baseline. Similarly, the authors of the PubMedQA dataset (Jin et al., 2019 ) implicitly encourage a retrieval-based approach, as the answers to its questions are derived from specific PubMed abstracts. The application of advanced machine learning techniques to optimize complex decision-making processes has shown promise across various domains. For instance, Ugwumba & Jaja ( 2025 ) demonstrated the effectiveness of Deep Q-Networks for dynamic task prioritization, achieving 92.3% accuracy in optimizing scheduling decisions based on multiple constraints. This reinforcement learning approach to sequential decision-making under constraints shares conceptual parallels with our goal of optimizing evidence retrieval strategies in clinical settings, where multiple factors (semantic relevance, causal strength, evidence quality) must be balanced to make optimal retrieval decisions Despite these advances, a significant gap remains. Standard RAG systems, as applied in these studies, optimize for semantic similarity between the query and the knowledge base. The retriever is trained to find texts that are topically relevant, but it is agnostic to the epistemic quality of the evidence. It cannot distinguish between a document describing a robust, randomized controlled trial (RCT) that establishes a causal link and a document describing a poorly controlled observational study that merely notes a correlation. Consequently, the generator can be provided with weak or misleading evidence, leading it to produce a confident but scientifically unfounded conclusion. Our Causal-RAG framework proposes a fundamental shift from semantic retrieval to evidence-quality-aware retrieval. 2.3 Causal Inference and its Integration with NLP Causal inference provides a formal framework for reasoning about cause-and-effect relationships, typically using frameworks such as Structural Causal Models (SCMs) and the do-calculus (Pearl, 2009). The integration of causal reasoning into machine learning has been explored to improve model robustness, fairness, and generalizability. In NLP, initial forays into causality have often focused on using causal graphs to de-bias models. For example, Feder et al. (2021) discussed how causal reasoning can help address spurious correlations in language models. Other work has focused on causal mediation analysis to understand the internal mechanisms of models (Vig et al., 2020). However, the application of causal inference to the retrieval component of an NLP system is far less explored. Within medical AI, some research has incorporated causal knowledge. This has typically been achieved by building structured causal knowledge graphs. For example, a system might be built upon the UMLS Metathesaurus, with manually defined causal relationships between entities. While valuable, this approach is inherently limited by the scope and scale of the pre-defined graph. It cannot handle novel relationships or findings reported in the latest literature that have not yet been codified into a knowledge base. It is a top-down, rigid approach. Our work diverges significantly. Instead of relying on a pre-compiled causal graph, we take a bottom-up, data-driven approach. We aim to teach the retriever to recognize the linguistic signatures of causal evidence directly within unstructured text. This involves identifying phrases, study designs, and contextual cues that are hallmarks of high-quality, causal medical research (e.g., "double-blind randomized controlled trial," "was associated with a significant reduction in," "causal mechanism"). This allows our system to be more agile and applicable to the ever-evolving body of medical literature, making it particularly suitable for low-resource settings that may rely on a diverse and updated set of informational sources. To our knowledge, this is the first work to propose and prototype a causally-augmented retrieval mechanism specifically designed to enhance the safety and trustworthiness of clinical RAG systems by prioritizing epistemically sound evidence. 3. Methodology This section details our systematic approach to developing and evaluating the Causal-RAG framework. We began by establishing a robust baseline, then designed and iterated on our causal augmentation strategy, followed by a comprehensive evaluation protocol. 3.1 Dataset and Experimental Setup For this study, we utilized a curated collection of medical instruction-following datasets. The primary dataset for our core experiments was the PubMedQA instruction dataset, comprising 2,112 question-answer pairs derived from PubMed research abstracts. Each data point is structured with an instruction (e.g., "As an expert doctor..."), an input containing the clinical question in a "Question: ...\nAnswer:" format, and an output which is the ground truth answer (typically 'yes', 'no', or 'maybe'). This dataset was chosen for its focus on evidence-based answers, where the response is explicitly anchored to a specific research abstract, making it ideal for evaluating a retrieval-based system. To simulate a knowledge base for our RAG system, we generated synthetic evidence passages for each question. While this is a limitation (addressed in Section 5 ), it provided a controlled environment for initial prototyping. Each synthetic evidence entry followed the format: "Medical research indicates that: [Question Phrasing]. This has been studied in clinical trials." All experiments were conducted in a Kaggle kernel environment, utilizing Python with key libraries including Transformers, Sentence-Transformers, FAISS, and pandas. 3.2 Baseline RAG System Implementation Our baseline system was designed to replicate a standard, semantically-driven RAG pipeline. Retriever : We employed a pre-trained Sentence-BERT model (all-MiniLM-L6-v2) to encode both the questions and the synthetic knowledge base passages into a 384-dimensional dense vector space. These embeddings were indexed using FAISS (Facebook AI Similarity Search) with an Inner Product (cosine similarity) index for efficient nearest-neighbor search. Generator : Given the computational constraints of the environment and our focus on the retrieval component, we implemented a rule-based generator for the baseline. For a given query, the top k = 3 most semantically similar passages were retrieved. The generator then analyzed these passages for simple keyword patterns (e.g., presence of 'yes', 'confirm' for a positive answer; 'no', 'not' for a negative answer). If positive indicators dominated, it returned 'yes'; if negative, 'no'; otherwise, 'maybe'. This provided a clear, interpretable baseline focused on the quality of the retrieved context. This baseline achieved a 90% accuracy on a 10-sample test set. However, a critical failure mode was immediately apparent: it exhibited a 100% "yes" bias, meaning it answered 'yes' to every single question. This demonstrated that while semantically effective, the baseline was dangerously overconfident and incapable of expressing uncertainty, a key motivation for our causal enhancement. 3.3 Causal-RAG Framework Design and Implementation The core of our contribution is the Causal-RAG framework, which introduces a causal relevance layer into the retrieval process. The objective was to shift retrieval from a purely semantic task to an evidence-quality-aware task. Our implementation proceeded through two iterative prototypes: Prototype 1: Causal Pattern-Based Retrieval The first approach augmented the semantic retrieval score with a bonus for causal language. We defined a set of regular expression patterns categorized by causal strength: Strong Causal: r'randomized.*controlled', r'RCT', r'clinical trial', r'causes?' Moderate Causal: r'associated with', r'predicts?', r'effect of' Weak Causal: r'study', r'research', r'results' The retrieval process was modified as follows: An initial set of candidate passages (k = 6) was retrieved based on semantic similarity. Each passage was scored based on the highest-ranked causal pattern it contained. A combined score was calculated: semantic_similarity + (causal_pattern_score * weight). The top 3 passages based on this combined score were passed to the generator. Result : This prototype successfully broke the "yes" bias but was excessively conservative, returning "maybe" for all queries (0% accuracy). The threshold for causal evidence was set too high for the synthetic knowledge base. Prototype 2: Evidence-Based Quality Assessment Learning from the first prototype, we developed a more nuanced, rule-based "generator" that acted as an evidence critic. This final Causal-RAG system worked as follows: A. Retrieval : The same semantic retriever fetched the top k = 3 passages. B. Evidence Analysis : For each retrieved passage, we performed: Keyword Overlap Check : Calculated the proportion of question keywords present in the passage. Evidence Strength Classification : Counted occurrences of positive ('yes', 'effective') and negative ('no', 'ineffective') indicators. C. D-ecision Logic : The final answer was determined by: If the total number of passages with strong keyword overlap was low → return "maybe" (insufficient evidence). If strong evidence passages existed, check the consensus of positive vs. negative indicators. A strong consensus for 'yes' or 'no' led to that answer; a mixed or weak consensus led to "maybe". This approach directly incorporated the principle of evidence quality into the answer generation process, making the system's confidence contingent on the clarity and relevance of the retrieved information. 3.4 Evaluation Metrics We moved beyond simple accuracy to a multi-faceted evaluation: a. Accuracy : Standard proportion of correct answers. b. Yes-Bias Rate : The proportion of answers that were 'yes', crucial for measuring overconfidence. c. Safety Score : A derived metric representing the inverse of the yes-bias, indicating the system's tendency to avoid risky, overconfident assertions. d. Qualitative Error Analysis : A detailed examination of cases where the baseline and Causal-RAG systems disagreed, to understand the nature of the improvements and failures. This methodological setup allowed us to directly test our hypothesis that causal augmentation trades off raw accuracy for improved safety and reduced hallucination potential. 4. Results and Discussion This section presents a detailed analysis of the experimental outcomes, moving beyond aggregate scores to dissect the nuanced performance and failure modes of both the baseline and our Causal-RAG prototypes. The results clearly illustrate the fundamental trade-off between accuracy and safety that is central to deploying AI in clinical settings. 4.1 Quantitative Performance Analysis The evaluation of our systems on a 10-sample test set revealed a stark contrast: i. Baseline RAG achieved a high accuracy of 90% (9/10 correct). However, this performance was critically flawed. A deeper look showed a 100% "Yes-Bias Rate": it answered 'yes' to every question, including the one sample whose true answer was 'no'. This resulted in a calculated Safety Score of just 0.10, indicating a dangerously overconfident system prone to hallucination. ii. Causal-RAG (Final Prototype) demonstrated a dramatic shift. Its accuracy dropped to 0% (0/10 correct), but this was a direct consequence of its designed conservatism. Its "Yes-Bias Rate" plummeted to 0%, with the system outputting 'no' for 5 samples and 'maybe' for the other 5. This yielded a high Safety Score of 0.80, reflecting its inherent design to avoid confident, unsupported assertions. Figure 1 visually captures this fundamental trade-off, showing the inverse relationship between accuracy and safety across our tested systems. The baseline RAG occupies the high-accuracy, low-safety quadrant, while our Causal-RAG prototypes demonstrate the opposite profile, prioritizing safety over raw performance metrics. This inverse relationship where the system's safety increases as its raw accuracy decreases is not a failure but a core finding. It visually demonstrates that a naive RAG system can achieve high accuracy by exploiting statistical biases in the data (in this case, a preponderance of 'yes' answers), but this creates a model that is fundamentally unsafe. Causal-RAG sacrifices this superficial performance for a more principled, evidence-based approach. 4.2 Qualitative Error Analysis and Case Studies The divergent behavior of each system is further illuminated in Fig. 2 , which directly compares accuracy against yes-bias rates. This visualization makes clear that the baseline's high accuracy was achieved through a dangerous 100% yes-bias strategy, while Causal-RAG successfully broke this pattern entirely. A sample-by-sample analysis of the 10 test cases provides crucial insight into the behavior of each system. The baseline and Causal-RAG disagreed on 100% of the samples (10/10). In these disagreements: A. Baseline Wins (9 cases) : In 9 instances, the baseline was correct and Causal-RAG was wrong. For example, for the question "Is heart-type fatty acid binding protein an early marker of myocardial damage after radiofrequency catheter ablation?" (True: 'yes'), the baseline correctly retrieved semantically similar contexts and output 'yes'. Causal-RAG, however, analyzed the same contexts and classified the evidence as weak or negative, leading it to output 'no' or 'maybe'. This highlights a key implementation challenge: our rule-based evidence critic was overly sensitive and misclassified the synthetic evidence. B. Causal-RAG's Critical Intervention (1 case) : The most telling result was for the question: "Are weekend days required to accurately measure oral intake in hospitalised patients?" (True: 'no'). The baseline, driven by its 'yes' bias, incorrectly answered 'yes'. Causal-RAG, after analyzing the retrieved contexts and finding no strong positive or negative indicators, correctly abstained by answering 'maybe'. This single case encapsulates the value proposition of Causal-RAG: it prevented a confident, incorrect hallucination. In a low-resource clinical setting, an AI saying "maybe" is far safer than one confidently giving the wrong instruction. 4.3 Discussion of the Safety-Accuracy Trade-off The results force a critical re-evaluation of what constitutes "good performance" for a clinical AI. The pursuit of maximum accuracy is a misleading goal if it is achieved through overconfidence and a failure to express uncertainty. a) The Peril of the "Yes-Man" AI : Our baseline RAG acted as a "yes-man"—highly agreeable but unreliable. This bias is particularly dangerous in medicine, where ruling out conditions (answering 'no') and acknowledging diagnostic uncertainty ('maybe') are essential clinical skills. An AI that always suggests a treatment or confirms a diagnosis is a recipe for over-medication and misdiagnosis. b) Calibrating Conservatism : The current Causal-RAG implementation errs too far on the side of caution. The challenge for future work is not to abandon the causal premise but to refine the evidence assessment model. The goal is a system that can confidently say 'yes' when the evidence is unequivocally causal (e.g., from an RCT), confidently say 'no' when evidence refutes a causal link, and say 'maybe' when the evidence is correlative, conflicting, or weak. This represents a move towards well-calibrated confidence. c) Implications for Low-Resource Settings : In a well-staffed hospital, an AI's 'yes' bias might be caught by a human expert. In a low-resource setting, that safeguard may not exist. Therefore, an AI's ability to self-limit its confidence based on evidence quality is not just a feature; it is a prerequisite for safe deployment. Our Causal-RAG framework provides the architectural blueprint for building such a self-limiting, safety-first system. This calibration challenge is further illustrated in Fig. 3 , which shows the distribution of evidence quality retrieved by each system. While Causal-RAG successfully shifted retrieval toward higher-quality evidence categories, the current implementation's evidence classification remains overly conservative, leading to excessive "maybe" responses even when adequate evidence exists. In conclusion, while the baseline RAG won on accuracy, it failed on the critical metric of trustworthiness. Causal-RAG, despite its low accuracy in this prototype, demonstrated the foundational behavior required for a safe clinical partner: the humility to withhold confidence when the evidence is lacking. The significant drop in accuracy is a measure of the severity of the baseline's overconfidence problem, not a dismissal of the causal approach. 5. Conclusion and Further Studies This research set out to address a critical flaw in the application of AI for clinical decision support: the dangerous tendency of models to be overconfident and hallucinate. We proposed and prototyped Causal-RAG, a novel framework designed to enhance the standard RAG architecture by integrating a layer of causal reasoning into the retrieval process. Our work was guided by the principle that for AI to be trustworthy in medicine, it must not only find relevant information but also critically appraise the quality of that evidence. 5.1 Conclusion Our experiments yielded a clear and important result. We successfully demonstrated that a standard RAG system, while capable of high accuracy, can develop a severe "yes-bias," making it an overconfident and unreliable partner in a clinical setting. The baseline system achieved 90% accuracy but did so by answering 'yes' to every question, a strategy that is both intellectually bankrupt and clinically hazardous. In contrast, our Causal-RAG prototype fundamentally broke this bias. By prioritizing evidence quality and incorporating a mechanism for expressing uncertainty, it reduced the yes-bias rate to 0%. The trade-off was a sharp decrease in raw accuracy, underscoring the inherent tension between being highly confident and being correct. This finding is not a failure of the Causal-RAG concept but rather a validation of its core premise: to prevent hallucinations, an AI must sometimes be cautious. The ability to say "I don't know" or "the evidence is unclear" is a vital feature, not a bug, for any system operating in a high-stakes, low-resource environment. Therefore, we conclude that the pursuit of pure accuracy is an inadequate goal for clinical AI. The Causal-RAG framework establishes a new direction, one that prioritizes evidence-based confidence calibration. Our work provides a foundational blueprint and a proof-of-concept for building AI systems that are not just knowledgeable, but also humble and safe. 5.2 Limitations This study has several limitations that provide avenues for future work. First, the use of a synthetic knowledge base, while useful for controlled prototyping, is a significant constraint. The evidence passages lacked the nuanced language and varied study designs of real medical literature, which likely hampered our causal detection methods. Second, our causal augmentation relied on rule-based patterns and a simple evidence critic. This approach, though interpretable, is fragile and lacks the sophistication of a learned model. Third, the evaluation scale was small, and the "safety score" was a simple heuristic, highlighting the need for more robust, human-in-the-loop evaluation metrics. 5.3 Future Studies Based on our findings and limitations, we propose the following concrete directions for future research: A. Implementation with Real-World Knowledge Bases : The most critical next step is to validate the Causal-RAG framework using real clinical evidence sources, such as PubMed Central full-text articles, Cochrane Library reviews, and clinical trial registries. This will test its ability to discern evidence quality in a realistic setting. B. Development of a Learned Causal Scorer : Replace the rule-based evidence critic with a machine learning model trained to score the causal strength of a medical text passage. This model could be fine-tuned on a dataset of texts labeled by medical experts for their level of causal evidence (e.g., RCT, cohort study, case report). C. Low-Resource Language Focus : Return to the original problem statement by applying this framework to clinical questions in low-resource languages. This would involve creating or curating relevant knowledge bases in these languages and adapting the causal retrieval mechanisms accordingly. D. Human-in-the-Loop Evaluation : Conduct formal user studies with clinicians and medical students in simulated low-resource settings. Key metrics would include perceived trust, usability, and the critical measure of whether the AI helps or hinders correct decision-making, especially when it expresses uncertainty. E. Integration with Fine-Tuned Generators : Combine the causally-augmented retriever with a medical LLM generator (e.g., a fine-tuned Llama or Mistral model) that is explicitly trained to phrase answers with appropriate confidence levels based on the provided evidence. By pursuing these directions, the Causal-RAG paradigm can evolve from a promising prototype into a robust, deployable technology that truly enhances the safety and efficacy of clinical decision support for all populations, regardless of their resources. Declarations Ethical Approval Not applicable. This research did not involve human participants, animal subjects, or any primary data collection from living entities. Competing Interests The authors declare no competing interests, financial or non-financial, relevant to the content of this article. Funding The authors received no specific funding for this work. Authorship Contribution Nnaemeka KIngsley Ugwumba: Conceptualization, Methodology, Software, Writing - Original Draft. Peter Sunday Jaja: Review and corrections The authors reviewed and approved the final manuscript. Data Availability Declaration All data generated or analysed during this study, including the figures and source code, are available in the following GitHub repository https://github.com/KingsleyTechie/Causal-RAG References Harry J (2024) Med-datasets [Data set]. Kaggle. https://www.kaggle.com/datasets/joshharry/med-datasets Jin Q, Dhingra B, Liu Z, Cohen WW, Lu X (2019) PubMedQA: A dataset for biomedical research question answering. *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)*, 2567–2577. https://doi.org/10.18653/v1/D19-1259 Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4):1234–1240. https://doi.org/10.1093/bioinformatics/btz682 Lewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, Küttler H, Lewis M, Yih W, Rocktäschel T, Riedel S, Kiela D (2020) Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459–9474. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Schärli N, Chowdhery A, Mansfield P, Demner-Fushman D, Natarajan V (2023) Towards expert-level medical question answering with large language models. arXiv. https://arxiv.org/abs/2305.09617 Ugwumba NK (2024) SocialGuard: An integrated framework for proactive fake account detection leveraging behavioral APIs and malicious URL profiling. https://doi.org/10.21203/rs.3.rs-8085397/v1 . Research Square (Preprint) Ugwumba NK, Jaja PS (2025) Enhanced task prioritization system using Deep-Q-Network model. Int J Comput Sci Eng Techniques 9(6). https://doi.org/10.5281/zenodo.17636107 . IJCSE-V9I6P15 Additional Declarations The authors declare no competing interests. Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8163345","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":548052506,"identity":"937f615d-252f-4b3f-9251-b1836d5317bc","order_by":0,"name":"Nnaemeka Kingsley Ugwumba","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABMklEQVRIiWNgGAWjYPCCBBjjAAM/mIQAA+K0SDZAtEgQr8UAagVOLebSzQ8f3ahIk2eQSL8mzZtzR974Ru7BwwUV9+oY2Ju3STDuqEXXYjnnmLFxzpkcwwaJnDJp3m3PDLfdyEs4PONMsQQDz7EyCcYzx9G1GNxIMJPObatgBGpJu8277TDjths5Bod52xIkGCRyzCQY245hakn//jv3X4U9TIv95hkwLfJvcGjJMWPObchJbJBIPwbSkrhBAm4LD0hLDRYtxdI5x9KSG3jesP+cu+1Z8owzbwyAfkmQbONJK7ZIbDuAxWEbP+fUJNs2sKc/Nni77Y5tf3uO8eeCigR+fvbDG298bKvDFtBgYH+ABxELzCCCDUQkMBzGqYWBgf0BqhYowG3LKBgFo2AUjBQAAIY0dhyXTjBlAAAAAElFTkSuQmCC","orcid":"https://orcid.org/0009-0000-2493-9846","institution":"University of Port Harcourt, Port Harcourt, Nigeria","correspondingAuthor":true,"prefix":"","firstName":"Nnaemeka","middleName":"Kingsley","lastName":"Ugwumba","suffix":""},{"id":548052507,"identity":"570e6842-f4af-48e1-bcd7-1eca1378ab75","order_by":1,"name":"Peter Sunday Jaja","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABGUlEQVRIiWNgGAWjYPACNgbGBuYDDAwGUD4PXtXMMC1sCRAtbMRpAZtsALMRvxZz9v6DjysY+OSZ2898ky4ouCNnPr+B8cHbNoY8eQfsWix7DjMbnmFgM2zsyd0mPcPgmbHMMQZmw7ltDMWGB7BrMbiRzCbZwMDG2NgA1MJjcDhxBhsDmzRvG0PixgYcWu4/Zv8J1GLf2P/mGUwL+2+8Wm4wszECtSQ2zshhg9vCDNIyH4f3LXuSjUEOS26c8czYmgfoFwm2xGbJOeckEjfgDLGDDz82MByz3dif/PA2z587chLMhw9+eFNmkzgfl8NABOO/YwyGEAUHQFwQU4LB4AAeLQwMNQzyDHAtUCCPw5ZRMApGwSgYcQAAPuBVTpNqyxMAAAAASUVORK5CYII=","orcid":"","institution":"University of Port Harcourt, Port Harcourt, Nigeria","correspondingAuthor":true,"prefix":"","firstName":"Peter","middleName":"Sunday","lastName":"Jaja","suffix":""}],"badges":[],"createdAt":"2025-11-20 10:19:06","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":true,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":true},"doi":"10.21203/rs.3.rs-8163345/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8163345/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":96455550,"identity":"b29487ac-7e75-468f-9264-3bce29b09fe0","added_by":"auto","created_at":"2025-11-21 10:04:20","extension":"docx","order_by":0,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":270522,"visible":true,"origin":"","legend":"","description":"","filename":"CausalRAGpaper.docx","url":"https://assets-eu.researchsquare.com/files/rs-8163345/v1/8399582736247a9d03d9f60e.docx"},{"id":96431765,"identity":"29a55b44-c9db-462d-b3e1-0c688879cfb4","added_by":"auto","created_at":"2025-11-21 04:10:30","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":342,"visible":true,"origin":"","legend":"","description":"","filename":"rs8163345.json","url":"https://assets-eu.researchsquare.com/files/rs-8163345/v1/10975e92eb8d20a625bcb3cb.json"},{"id":96455545,"identity":"07d8cc60-bb28-419d-bd7d-e39020f8832c","added_by":"auto","created_at":"2025-11-21 10:04:19","extension":"xml","order_by":2,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":56046,"visible":true,"origin":"","legend":"","description":"","filename":"rs81633450enriched.xml","url":"https://assets-eu.researchsquare.com/files/rs-8163345/v1/d034effc434dcca0667c9d8d.xml"},{"id":96455378,"identity":"3adb17df-95e4-43d2-9769-97ff05a68608","added_by":"auto","created_at":"2025-11-21 10:04:01","extension":"png","order_by":3,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":86619,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8163345/v1/e8357d592836e99e0a59f285.png"},{"id":96454362,"identity":"0bc6d462-5704-4578-bc5c-b47be2ed85cd","added_by":"auto","created_at":"2025-11-21 10:02:39","extension":"png","order_by":4,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":69272,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8163345/v1/b36aff97355fa257ca62961e.png"},{"id":96431774,"identity":"fbe0060c-fa84-473e-9a8e-bf5699342eac","added_by":"auto","created_at":"2025-11-21 04:10:30","extension":"png","order_by":5,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":74060,"visible":true,"origin":"","legend":"","description":"","filename":"floatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8163345/v1/8146ecf812f0419f5ed86dcd.png"},{"id":96431769,"identity":"eef82f65-5bf7-4e19-b672-88db3ee38c07","added_by":"auto","created_at":"2025-11-21 04:10:30","extension":"png","order_by":6,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":21375,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage1.png","url":"https://assets-eu.researchsquare.com/files/rs-8163345/v1/cd208c8bc9ee5c27144a552c.png"},{"id":96431771,"identity":"ad5f50f9-82dd-46ec-be51-18df2e5ddee6","added_by":"auto","created_at":"2025-11-21 04:10:30","extension":"png","order_by":7,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":19797,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage2.png","url":"https://assets-eu.researchsquare.com/files/rs-8163345/v1/47c573311b0db365843cbb25.png"},{"id":96431776,"identity":"9ac2290a-3141-464f-98b4-39f10b856cc8","added_by":"auto","created_at":"2025-11-21 04:10:30","extension":"png","order_by":8,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":20277,"visible":true,"origin":"","legend":"","description":"","filename":"Onlinefloatimage3.png","url":"https://assets-eu.researchsquare.com/files/rs-8163345/v1/728c9426a3e37909c17b48c2.png"},{"id":96431778,"identity":"13386fa9-51d2-482e-8d57-aadb9c5afde2","added_by":"auto","created_at":"2025-11-21 04:10:30","extension":"xml","order_by":9,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":54550,"visible":true,"origin":"","legend":"","description":"","filename":"rs81633450structuring.xml","url":"https://assets-eu.researchsquare.com/files/rs-8163345/v1/1bc65aaeb55871e2c192725a.xml"},{"id":96431777,"identity":"395f5234-ac95-4b07-ad04-07a4708857c9","added_by":"auto","created_at":"2025-11-21 04:10:30","extension":"html","order_by":10,"title":"","display":"","copyAsset":false,"role":"acdc-reference","size":63245,"visible":true,"origin":"","legend":"","description":"","filename":"earlyproof.html","url":"https://assets-eu.researchsquare.com/files/rs-8163345/v1/18ebdb64ed8a1217191e1977.html"},{"id":96455099,"identity":"ca67bdb5-acd9-48de-84e6-bbae8f55dda9","added_by":"auto","created_at":"2025-11-21 10:03:33","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":186488,"visible":true,"origin":"","legend":"\u003cp\u003eAccuracy vs Safety Trade-off in Clinical RAG Systems\u003c/p\u003e","description":"","filename":"accuracyvssafetytradeoff.png","url":"https://assets-eu.researchsquare.com/files/rs-8163345/v1/510b4087bad15e7e3f456c47.png"},{"id":96431773,"identity":"53ac6abf-1b13-4f4a-b758-fd0e075a5863","added_by":"auto","created_at":"2025-11-21 04:10:30","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":142289,"visible":true,"origin":"","legend":"\u003cp\u003eAccuracy vs Yes-Bias Rate Across RAG Methods\u003c/p\u003e","description":"","filename":"accuracyvsyesbias.png","url":"https://assets-eu.researchsquare.com/files/rs-8163345/v1/9acbcbc625f73800d78d43f6.png"},{"id":96431767,"identity":"00374781-4790-47c1-889a-ead10f08c5ec","added_by":"auto","created_at":"2025-11-21 04:10:30","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":181153,"visible":true,"origin":"","legend":"\u003cp\u003eEvidence Quality Distribution: Baseline vs Causal-RAG\u003c/p\u003e","description":"","filename":"evidencequalitycomparison.png","url":"https://assets-eu.researchsquare.com/files/rs-8163345/v1/087ae7a8aa3afd2573ff5403.png"},{"id":97135673,"identity":"5ea7a4e3-0dac-4df9-a261-5b5d496a4789","added_by":"auto","created_at":"2025-12-01 09:52:45","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1370185,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8163345/v1/6795c4dd-4230-4846-a6bc-af5ca4695835.pdf"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eCausal-RAG: Causally-Augmented Retrieval for Hallucination-Free Clinical Decision Support in Low-Resource Settings\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eThe adoption of Artificial Intelligence (AI) in clinical decision support holds immense promise for improving healthcare delivery, especially in low-resource settings where expert medical personnel are scarce. Large Language Models (LLMs) can potentially bridge this gap by providing instant access to medical knowledge. However, their practical application is severely hampered by a well-documented tendency to \"hallucinate\": that is, to generate plausible-sounding but factually incorrect or unsupported information. This is not merely an academic concern; in a clinical context, an overconfident and incorrect AI recommendation can directly impact patient safety and outcomes.\u003c/p\u003e\u003cp\u003eRetrieval-Augmented Generation (RAG) has emerged as a primary technique to mitigate hallucinations by grounding an LLM's responses in retrieved evidence from external knowledge bases, such as medical textbooks or research papers. While effective, standard RAG systems have a critical weakness: they retrieve information based on semantic similarity but lack the ability to judge the quality or nature of the evidence they find. A text describing a spurious correlation can be retrieved with high confidence, leading the LLM to generate a confident but causally unfounded conclusion. This is analogous to a medical student who can recall facts but cannot critically appraise a clinical study.\u003c/p\u003e\u003cp\u003eIn this research, we posit that the key to building more trustworthy clinical AI lies in moving beyond semantic retrieval to evidence-aware retrieval. We specifically focus on causal evidence. In medicine, establishing causality: for instance, that a drug causes an improvement in symptoms, not just that the two are associated, is the gold standard for evidence-based practice.\u003c/p\u003e\u003cp\u003eTo this end, we introduce Causal-RAG, a novel framework designed to enhance the reliability of clinical AI for low-resource settings. The core idea is to augment the standard RAG retrieval process with a causal relevance layer. This layer prioritizes documents that contain language indicative of causal relationships (e.g., from randomized controlled trials) over those that merely report correlations.\u003c/p\u003e\u003cp\u003eIn this paper, we detail the implementation of a Causal-RAG prototype. We began by establishing a strong baseline using a standard RAG system on a dataset of clinical questions. We then developed and iterated on several causal augmentation strategies. Our results are revealing: the baseline RAG system, while achieving high nominal accuracy, exhibited a near-total \"yes\" bias, answering affirmatively to almost every question. This demonstrates a critical form of overconfidence. Our subsequent Causal-RAG implementations successfully broke this bias, making the system more conservative and evidence-based, albeit at an initial cost to accuracy. This trade-off between accuracy and safety is a central finding of our work, highlighting a crucial consideration for the deployment of AI in real-world clinical environments.\u003c/p\u003e\u003cp\u003eThe main goal of this research is to build and test this Causal-RAG concept. Our specific objectives are:\u003c/p\u003e\u003cp\u003e\u003col\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eTo build a standard RAG system for clinical questions and identify its specific weaknesses, especially its tendency to be overconfident.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eTo design and implement a new retrieval method that can find and prioritize causally-sound medical evidence.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eTo test our Causal-RAG system against the standard one, measuring not only which is more accurate, but more importantly, which is safer and less prone to dangerous hallucinations.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003cspan\u003e\u003cli\u003e\u003cp\u003eTo analyze the trade-off between being highly confident and being correct, providing a foundation for future, more reliable clinical AI tools.\u003c/p\u003e\u003c/li\u003e\u003c/span\u003e\u003c/ol\u003e\u003c/p\u003e\u003cp\u003eThis approach aims to create AI that is not just smart, but also trustworthy and safe enough to be used in real-world clinical environments where the margin for error is zero.\u003c/p\u003e"},{"header":"2. Related Work","content":"\u003cp\u003eThis research sits at the intersection of clinical natural language processing, retrieval-augmented generation, and causal inference. A thorough review of these domains is essential to contextualize our contribution.\u003c/p\u003e\u003cdiv id=\"Sec3\" class=\"Section2\"\u003e\u003ch2\u003e2.1 Clinical Language Models and Their Limitations\u003c/h2\u003e\u003cp\u003eThe development of domain-specific language models has been a cornerstone of medical AI research. BioBERT (Lee et al., \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) was a landmark achievement, demonstrating that pre-training BERT on large-scale biomedical corpora (PubMed abstracts and PMC full-text articles) significantly improved performance on tasks like named entity recognition and relation extraction. ClinicalBERT (Alsentzer et al., 2019) extended this approach by focusing on clinical notes from the MIMIC-III database, tailoring the model to the specific language and abbreviations used in electronic health records. These models represented a substantial leap over general-purpose LLMs in understanding medical terminology.\u003c/p\u003e\u003cp\u003eMore recently, the field has witnessed the emergence of generative clinical models. Med-PaLM (Singhal et al., 2022) was one of the first LLMs to reach expert-level performance on U.S. Medical Licensing Examination (USMLE)-style questions, showcasing remarkable medical knowledge recall. Its successor, Med-PaLM 2 (Singhal et al., \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), further pushed the boundaries, emphasizing the model's ability to provide reasoning and align with medical consensus.\u003c/p\u003e\u003cp\u003eHowever, a critical and persistent limitation across all these models, explicitly acknowledged by their creators, is the propensity for hallucination. The Med-PaLM 2 paper notes that while their model achieved high accuracy, a significant portion of its remaining errors were \"hallucinations or non-factual statements.\" This is not a problem unique to these models; it is a fundamental issue with the auto-regressive nature of LLMs, which are trained to generate plausible text sequences rather than to be factually correct. In a low-resource clinical setting, where a clinician may lack the expertise to fact-check the AI, this limitation becomes a critical safety hazard. Our work directly addresses this by building a safeguard against such confabulation.\u003c/p\u003e\u003cp\u003eBeyond clinical applications, the integration of multiple data modalities for enhanced detection has shown success in other security-critical domains. For instance, Ugwumba (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) proposed SocialGuard, a framework for fake account detection that combines behavioral API monitoring with malicious URL profiling. Their approach of extracting hybrid features from user activity and external patterns, achieving 99.51% accuracy, demonstrates the power of multi-modal feature integration for complex classification tasks. While our domain differs, this work inspires our approach of combining semantic retrieval with causal evidence patterns to create a more robust detection system for clinical hallucinations\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e\u003ch2\u003e2.2 Retrieval-Augmented Generation (RAG) in Medicine\u003c/h2\u003e\u003cp\u003eRetrieval-Augmented Generation was introduced by Lewis et al. (\u003cspan citationid=\"CR4\" class=\"CitationRef\"\u003e2020\u003c/span\u003e) as a general framework to mitigate hallucination and knowledge-update issues in LLMs by grounding generation in retrieved evidence. The paradigm involves a retriever module (often a dense passage retriever like DPR) that fetches relevant documents from a knowledge corpus, and a generator module (a seq2seq model) that produces an answer conditioned on both the query and the retrieved passages.\u003c/p\u003e\u003cp\u003eThe application of RAG to the medical domain has been an active area of research. For instance, Agrawal et al. (2022) explored RAG for open-domain medical question answering, using a corpus of medical textbooks and research articles. Their work demonstrated that RAG could significantly improve the factuality of responses compared to a generative-only baseline. Similarly, the authors of the PubMedQA dataset (Jin et al., \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2019\u003c/span\u003e) implicitly encourage a retrieval-based approach, as the answers to its questions are derived from specific PubMed abstracts.\u003c/p\u003e\u003cp\u003eThe application of advanced machine learning techniques to optimize complex decision-making processes has shown promise across various domains. For instance, Ugwumba \u0026amp; Jaja (\u003cspan citationid=\"CR7\" class=\"CitationRef\"\u003e2025\u003c/span\u003e) demonstrated the effectiveness of Deep Q-Networks for dynamic task prioritization, achieving 92.3% accuracy in optimizing scheduling decisions based on multiple constraints. This reinforcement learning approach to sequential decision-making under constraints shares conceptual parallels with our goal of optimizing evidence retrieval strategies in clinical settings, where multiple factors (semantic relevance, causal strength, evidence quality) must be balanced to make optimal retrieval decisions\u003c/p\u003e\u003cp\u003eDespite these advances, a significant gap remains. Standard RAG systems, as applied in these studies, optimize for semantic similarity between the query and the knowledge base. The retriever is trained to find texts that are topically relevant, but it is agnostic to the epistemic quality of the evidence. It cannot distinguish between a document describing a robust, randomized controlled trial (RCT) that establishes a causal link and a document describing a poorly controlled observational study that merely notes a correlation. Consequently, the generator can be provided with weak or misleading evidence, leading it to produce a confident but scientifically unfounded conclusion. Our Causal-RAG framework proposes a fundamental shift from semantic retrieval to evidence-quality-aware retrieval.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec5\" class=\"Section2\"\u003e\u003ch2\u003e2.3 Causal Inference and its Integration with NLP\u003c/h2\u003e\u003cp\u003eCausal inference provides a formal framework for reasoning about cause-and-effect relationships, typically using frameworks such as Structural Causal Models (SCMs) and the do-calculus (Pearl, 2009). The integration of causal reasoning into machine learning has been explored to improve model robustness, fairness, and generalizability.\u003c/p\u003e\u003cp\u003eIn NLP, initial forays into causality have often focused on using causal graphs to de-bias models. For example, Feder et al. (2021) discussed how causal reasoning can help address spurious correlations in language models. Other work has focused on causal mediation analysis to understand the internal mechanisms of models (Vig et al., 2020). However, the application of causal inference to the retrieval component of an NLP system is far less explored.\u003c/p\u003e\u003cp\u003eWithin medical AI, some research has incorporated causal knowledge. This has typically been achieved by building structured causal knowledge graphs. For example, a system might be built upon the UMLS Metathesaurus, with manually defined causal relationships between entities. While valuable, this approach is inherently limited by the scope and scale of the pre-defined graph. It cannot handle novel relationships or findings reported in the latest literature that have not yet been codified into a knowledge base. It is a top-down, rigid approach.\u003c/p\u003e\u003cp\u003eOur work diverges significantly. Instead of relying on a pre-compiled causal graph, we take a bottom-up, data-driven approach. We aim to teach the retriever to recognize the linguistic signatures of causal evidence directly within unstructured text. This involves identifying phrases, study designs, and contextual cues that are hallmarks of high-quality, causal medical research (e.g., \"double-blind randomized controlled trial,\" \"was associated with a significant reduction in,\" \"causal mechanism\"). This allows our system to be more agile and applicable to the ever-evolving body of medical literature, making it particularly suitable for low-resource settings that may rely on a diverse and updated set of informational sources. To our knowledge, this is the first work to propose and prototype a causally-augmented retrieval mechanism specifically designed to enhance the safety and trustworthiness of clinical RAG systems by prioritizing epistemically sound evidence.\u003c/p\u003e\u003c/div\u003e"},{"header":"3. Methodology","content":"\u003cp\u003eThis section details our systematic approach to developing and evaluating the Causal-RAG framework. We began by establishing a robust baseline, then designed and iterated on our causal augmentation strategy, followed by a comprehensive evaluation protocol.\u003c/p\u003e\n\u003cdiv id=\"Sec7\" class=\"Section2\"\u003e\n\u003ch2\u003e3.1 Dataset and Experimental Setup\u003c/h2\u003e\n\u003cp\u003eFor this study, we utilized a curated collection of medical instruction-following datasets. The primary dataset for our core experiments was the PubMedQA instruction dataset, comprising 2,112 question-answer pairs derived from PubMed research abstracts. Each data point is structured with an instruction (e.g., \"As an expert doctor...\"), an input containing the clinical question in a \"Question: ...\\nAnswer:\" format, and an output which is the ground truth answer (typically 'yes', 'no', or 'maybe').\u003c/p\u003e\n\u003cp\u003eThis dataset was chosen for its focus on evidence-based answers, where the response is explicitly anchored to a specific research abstract, making it ideal for evaluating a retrieval-based system. To simulate a knowledge base for our RAG system, we generated synthetic evidence passages for each question. While this is a limitation (addressed in Section \u003cspan class=\"InternalRef\"\u003e5\u003c/span\u003e), it provided a controlled environment for initial prototyping. Each synthetic evidence entry followed the format: \"Medical research indicates that: [Question Phrasing]. This has been studied in clinical trials.\" All experiments were conducted in a Kaggle kernel environment, utilizing Python with key libraries including Transformers, Sentence-Transformers, FAISS, and pandas.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec8\" class=\"Section2\"\u003e\n\u003ch2\u003e3.2 Baseline RAG System Implementation\u003c/h2\u003e\n\u003cp\u003eOur baseline system was designed to replicate a standard, semantically-driven RAG pipeline.\u003c/p\u003e\n\u003col\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003eRetriever\u003c/strong\u003e: We employed a pre-trained Sentence-BERT model (all-MiniLM-L6-v2) to encode both the questions and the synthetic knowledge base passages into a 384-dimensional dense vector space. These embeddings were indexed using FAISS (Facebook AI Similarity Search) with an Inner Product (cosine similarity) index for efficient nearest-neighbor search.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003eGenerator\u003c/strong\u003e: Given the computational constraints of the environment and our focus on the retrieval component, we implemented a rule-based generator for the baseline. For a given query, the top k\u0026thinsp;=\u0026thinsp;3 most semantically similar passages were retrieved. The generator then analyzed these passages for simple keyword patterns (e.g., presence of 'yes', 'confirm' for a positive answer; 'no', 'not' for a negative answer). If positive indicators dominated, it returned 'yes'; if negative, 'no'; otherwise, 'maybe'. This provided a clear, interpretable baseline focused on the quality of the retrieved context.\u003c/p\u003e\n\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eThis baseline achieved a 90% accuracy on a 10-sample test set. However, a critical failure mode was immediately apparent: it exhibited a 100% \"yes\" bias, meaning it answered 'yes' to every single question. This demonstrated that while semantically effective, the baseline was dangerously overconfident and incapable of expressing uncertainty, a key motivation for our causal enhancement.\u003c/p\u003e\n\u003c/div\u003e\n\u003cdiv id=\"Sec9\" class=\"Section2\"\u003e\n\u003ch2\u003e3.3 Causal-RAG Framework Design and Implementation\u003c/h2\u003e\n\u003cp\u003eThe core of our contribution is the Causal-RAG framework, which introduces a causal relevance layer into the retrieval process. The objective was to shift retrieval from a purely semantic task to an evidence-quality-aware task.\u003c/p\u003e\n\u003cp\u003eOur implementation proceeded through two iterative prototypes:\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePrototype 1: Causal Pattern-Based Retrieval\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe first approach augmented the semantic retrieval score with a bonus for causal language. We defined a set of regular expression patterns categorized by causal strength:\u003c/p\u003e\n\u003col style=\"list-style-type: lower-roman;\"\u003e\n\u003cli\u003e\n\u003cp\u003eStrong Causal: r'randomized.*controlled', r'RCT', r'clinical trial', r'causes?'\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eModerate Causal: r'associated with', r'predicts?', r'effect of'\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eWeak Causal: r'study', r'research', r'results'\u003c/p\u003e\n\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003eThe retrieval process was modified as follows:\u003c/p\u003e\n\u003col style=\"list-style-type: lower-alpha;\"\u003e\n\u003cli\u003e\n\u003cp\u003eAn initial set of candidate passages (k\u0026thinsp;=\u0026thinsp;6) was retrieved based on semantic similarity.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eEach passage was scored based on the highest-ranked causal pattern it contained.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eA combined score was calculated: semantic_similarity + (causal_pattern_score * weight).\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eThe top 3 passages based on this combined score were passed to the generator.\u003c/p\u003e\n\u003c/li\u003e\n\u003c/ol\u003e\n\u003cp\u003e\u003cstrong\u003eResult\u003c/strong\u003e: This prototype successfully broke the \"yes\" bias but was excessively conservative, returning \"maybe\" for all queries (0% accuracy). The threshold for causal evidence was set too high for the synthetic knowledge base.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePrototype 2: Evidence-Based Quality Assessment\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eLearning from the first prototype, we developed a more nuanced, rule-based \"generator\" that acted as an evidence critic. This final Causal-RAG system worked as follows:\u003c/p\u003e\n\u003cstrong\u003eA. Retrieval\u003c/strong\u003e: The same semantic retriever fetched the top k\u0026thinsp;=\u0026thinsp;3 passages.\u003c/div\u003e\n\u003cdiv class=\"Section2\"\u003e\u003cstrong\u003eB. Evidence Analysis\u003c/strong\u003e: For each retrieved passage, we performed:\u003cbr /\u003e\n\u003col style=\"list-style-type: lower-roman;\"\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003eKeyword Overlap Check\u003c/strong\u003e: Calculated the proportion of question keywords present in the passage.\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003e\u003cstrong\u003eEvidence Strength Classification\u003c/strong\u003e: Counted occurrences of positive ('yes', 'effective') and negative ('no', 'ineffective') indicators.\u003c/p\u003e\n\u003c/li\u003e\n\u003c/ol\u003e\n\u003cstrong\u003eC. D-ecision Logic\u003c/strong\u003e: The final answer was determined by:\u003cbr /\u003e\n\u003col style=\"list-style-type: lower-roman;\"\u003e\n\u003cli\u003e\n\u003cp\u003eIf the total number of passages with strong keyword overlap was low \u0026rarr; return \"maybe\" (insufficient evidence).\u003c/p\u003e\n\u003c/li\u003e\n\u003cli\u003e\n\u003cp\u003eIf strong evidence passages existed, check the consensus of positive vs. negative indicators. A strong consensus for 'yes' or 'no' led to that answer; a mixed or weak consensus led to \"maybe\".\u003c/p\u003e\n\u003c/li\u003e\n\u003c/ol\u003e\nThis approach directly incorporated the principle of evidence quality into the answer generation process, making the system's confidence contingent on the clarity and relevance of the retrieved information.\u003c/div\u003e\n\u003cdiv id=\"Sec10\" class=\"Section2\"\u003e\n\u003ch2\u003e3.4 Evaluation Metrics\u003c/h2\u003e\n\u003cp\u003eWe moved beyond simple accuracy to a multi-faceted evaluation:\u003c/p\u003e\n\u003cp\u003ea. \u003cstrong\u003eAccuracy\u003c/strong\u003e: Standard proportion of correct answers.\u003c/p\u003e\n\u003cp\u003eb. \u003cstrong\u003eYes-Bias Rate\u003c/strong\u003e: The proportion of answers that were 'yes', crucial for measuring overconfidence.\u003c/p\u003e\n\u003cp\u003ec. \u003cstrong\u003eSafety Score\u003c/strong\u003e: A derived metric representing the inverse of the yes-bias, indicating the system's tendency to avoid risky, overconfident assertions.\u003c/p\u003e\n\u003cp\u003ed. \u003cstrong\u003eQualitative Error Analysis\u003c/strong\u003e: A detailed examination of cases where the baseline and Causal-RAG systems disagreed, to understand the nature of the improvements and failures.\u003c/p\u003e\n\u003cp\u003eThis methodological setup allowed us to directly test our hypothesis that causal augmentation trades off raw accuracy for improved safety and reduced hallucination potential.\u003c/p\u003e\n\u003c/div\u003e"},{"header":"4. Results and Discussion","content":"\u003cp\u003eThis section presents a detailed analysis of the experimental outcomes, moving beyond aggregate scores to dissect the nuanced performance and failure modes of both the baseline and our Causal-RAG prototypes. The results clearly illustrate the fundamental trade-off between accuracy and safety that is central to deploying AI in clinical settings.\u003c/p\u003e\u003cdiv id=\"Sec12\" class=\"Section2\"\u003e\u003ch2\u003e4.1 Quantitative Performance Analysis\u003c/h2\u003e\u003cp\u003eThe evaluation of our systems on a 10-sample test set revealed a stark contrast:\u003c/p\u003e\u003cp\u003ei. \u003cb\u003eBaseline RAG\u003c/b\u003e achieved a high accuracy of 90% (9/10 correct). However, this performance was critically flawed. A deeper look showed a 100% \"Yes-Bias Rate\": it answered 'yes' to every question, including the one sample whose true answer was 'no'. This resulted in a calculated Safety Score of just 0.10, indicating a dangerously overconfident system prone to hallucination.\u003c/p\u003e\u003cp\u003eii. \u003cb\u003eCausal-RAG (Final Prototype)\u003c/b\u003e demonstrated a dramatic shift. Its accuracy dropped to 0% (0/10 correct), but this was a direct consequence of its designed conservatism. Its \"Yes-Bias Rate\" plummeted to 0%, with the system outputting 'no' for 5 samples and 'maybe' for the other 5. This yielded a high Safety Score of 0.80, reflecting its inherent design to avoid confident, unsupported assertions.\u003c/p\u003e\u003cp\u003eFigure \u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e visually captures this fundamental trade-off, showing the inverse relationship between accuracy and safety across our tested systems. The baseline RAG occupies the high-accuracy, low-safety quadrant, while our Causal-RAG prototypes demonstrate the opposite profile, prioritizing safety over raw performance metrics.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eThis inverse relationship where the system's safety increases as its raw accuracy decreases is not a failure but a core finding. It visually demonstrates that a naive RAG system can achieve high accuracy by exploiting statistical biases in the data (in this case, a preponderance of 'yes' answers), but this creates a model that is fundamentally unsafe. Causal-RAG sacrifices this superficial performance for a more principled, evidence-based approach.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e\u003ch2\u003e4.2 Qualitative Error Analysis and Case Studies\u003c/h2\u003e\u003cp\u003eThe divergent behavior of each system is further illuminated in Fig.\u0026nbsp;\u003cspan refid=\"Fig2\" class=\"InternalRef\"\u003e2\u003c/span\u003e, which directly compares accuracy against yes-bias rates. This visualization makes clear that the baseline's high accuracy was achieved through a dangerous 100% yes-bias strategy, while Causal-RAG successfully broke this pattern entirely.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eA sample-by-sample analysis of the 10 test cases provides crucial insight into the behavior of each system. The baseline and Causal-RAG disagreed on 100% of the samples (10/10). In these disagreements:\u003c/p\u003e\u003cp\u003eA. \u003cb\u003eBaseline Wins (9 cases)\u003c/b\u003e: In 9 instances, the baseline was correct and Causal-RAG was wrong. For example, for the question \"Is heart-type fatty acid binding protein an early marker of myocardial damage after radiofrequency catheter ablation?\" (True: 'yes'), the baseline correctly retrieved semantically similar contexts and output 'yes'. Causal-RAG, however, analyzed the same contexts and classified the evidence as weak or negative, leading it to output 'no' or 'maybe'. This highlights a key implementation challenge: our rule-based evidence critic was overly sensitive and misclassified the synthetic evidence.\u003c/p\u003e\u003cp\u003eB. \u003cb\u003eCausal-RAG's Critical Intervention (1 case)\u003c/b\u003e: The most telling result was for the question: \"Are weekend days required to accurately measure oral intake in hospitalised patients?\" (True: 'no'). The baseline, driven by its 'yes' bias, incorrectly answered 'yes'. Causal-RAG, after analyzing the retrieved contexts and finding no strong positive or negative indicators, correctly abstained by answering 'maybe'. This single case encapsulates the value proposition of Causal-RAG: it prevented a confident, incorrect hallucination. In a low-resource clinical setting, an AI saying \"maybe\" is far safer than one confidently giving the wrong instruction.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec14\" class=\"Section2\"\u003e\u003ch2\u003e4.3 Discussion of the Safety-Accuracy Trade-off\u003c/h2\u003e\u003cp\u003eThe results force a critical re-evaluation of what constitutes \"good performance\" for a clinical AI. The pursuit of maximum accuracy is a misleading goal if it is achieved through overconfidence and a failure to express uncertainty.\u003c/p\u003e\u003cp\u003ea) \u003cb\u003eThe Peril of the \"Yes-Man\" AI\u003c/b\u003e: Our baseline RAG acted as a \"yes-man\"\u0026mdash;highly agreeable but unreliable. This bias is particularly dangerous in medicine, where ruling out conditions (answering 'no') and acknowledging diagnostic uncertainty ('maybe') are essential clinical skills. An AI that always suggests a treatment or confirms a diagnosis is a recipe for over-medication and misdiagnosis.\u003c/p\u003e\u003cp\u003eb) \u003cb\u003eCalibrating Conservatism\u003c/b\u003e: The current Causal-RAG implementation errs too far on the side of caution. The challenge for future work is not to abandon the causal premise but to refine the evidence assessment model. The goal is a system that can confidently say 'yes' when the evidence is unequivocally causal (e.g., from an RCT), confidently say 'no' when evidence refutes a causal link, and say 'maybe' when the evidence is correlative, conflicting, or weak. This represents a move towards well-calibrated confidence.\u003c/p\u003e\u003cp\u003ec) \u003cb\u003eImplications for Low-Resource Settings\u003c/b\u003e: In a well-staffed hospital, an AI's 'yes' bias might be caught by a human expert. In a low-resource setting, that safeguard may not exist. Therefore, an AI's ability to self-limit its confidence based on evidence quality is not just a feature; it is a prerequisite for safe deployment. Our Causal-RAG framework provides the architectural blueprint for building such a self-limiting, safety-first system.\u003c/p\u003e\u003cp\u003eThis calibration challenge is further illustrated in Fig.\u0026nbsp;\u003cspan refid=\"Fig3\" class=\"InternalRef\"\u003e3\u003c/span\u003e, which shows the distribution of evidence quality retrieved by each system. While Causal-RAG successfully shifted retrieval toward higher-quality evidence categories, the current implementation's evidence classification remains overly conservative, leading to excessive \"maybe\" responses even when adequate evidence exists.\u003c/p\u003e\u003cp\u003e\u003c/p\u003e\u003cp\u003eIn conclusion, while the baseline RAG won on accuracy, it failed on the critical metric of trustworthiness. Causal-RAG, despite its low accuracy in this prototype, demonstrated the foundational behavior required for a safe clinical partner: the humility to withhold confidence when the evidence is lacking. The significant drop in accuracy is a measure of the severity of the baseline's overconfidence problem, not a dismissal of the causal approach.\u003c/p\u003e\u003c/div\u003e"},{"header":"5. Conclusion and Further Studies","content":"\u003cp\u003eThis research set out to address a critical flaw in the application of AI for clinical decision support: the dangerous tendency of models to be overconfident and hallucinate. We proposed and prototyped Causal-RAG, a novel framework designed to enhance the standard RAG architecture by integrating a layer of causal reasoning into the retrieval process. Our work was guided by the principle that for AI to be trustworthy in medicine, it must not only find relevant information but also critically appraise the quality of that evidence.\u003c/p\u003e\u003cdiv id=\"Sec16\" class=\"Section2\"\u003e\u003ch2\u003e5.1 Conclusion\u003c/h2\u003e\u003cp\u003eOur experiments yielded a clear and important result. We successfully demonstrated that a standard RAG system, while capable of high accuracy, can develop a severe \"yes-bias,\" making it an overconfident and unreliable partner in a clinical setting. The baseline system achieved 90% accuracy but did so by answering 'yes' to every question, a strategy that is both intellectually bankrupt and clinically hazardous.\u003c/p\u003e\u003cp\u003eIn contrast, our Causal-RAG prototype fundamentally broke this bias. By prioritizing evidence quality and incorporating a mechanism for expressing uncertainty, it reduced the yes-bias rate to 0%. The trade-off was a sharp decrease in raw accuracy, underscoring the inherent tension between being highly confident and being correct. This finding is not a failure of the Causal-RAG concept but rather a validation of its core premise: to prevent hallucinations, an AI must sometimes be cautious. The ability to say \"I don't know\" or \"the evidence is unclear\" is a vital feature, not a bug, for any system operating in a high-stakes, low-resource environment.\u003c/p\u003e\u003cp\u003eTherefore, we conclude that the pursuit of pure accuracy is an inadequate goal for clinical AI. The Causal-RAG framework establishes a new direction, one that prioritizes evidence-based confidence calibration. Our work provides a foundational blueprint and a proof-of-concept for building AI systems that are not just knowledgeable, but also humble and safe.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec17\" class=\"Section2\"\u003e\u003ch2\u003e5.2 Limitations\u003c/h2\u003e\u003cp\u003eThis study has several limitations that provide avenues for future work. First, the use of a synthetic knowledge base, while useful for controlled prototyping, is a significant constraint. The evidence passages lacked the nuanced language and varied study designs of real medical literature, which likely hampered our causal detection methods. Second, our causal augmentation relied on rule-based patterns and a simple evidence critic. This approach, though interpretable, is fragile and lacks the sophistication of a learned model. Third, the evaluation scale was small, and the \"safety score\" was a simple heuristic, highlighting the need for more robust, human-in-the-loop evaluation metrics.\u003c/p\u003e\u003c/div\u003e\u003cdiv id=\"Sec18\" class=\"Section2\"\u003e\u003ch2\u003e5.3 Future Studies\u003c/h2\u003e\u003cp\u003eBased on our findings and limitations, we propose the following concrete directions for future research:\u003c/p\u003e\u003cp\u003eA. \u003cb\u003eImplementation with Real-World Knowledge Bases\u003c/b\u003e: The most critical next step is to validate the Causal-RAG framework using real clinical evidence sources, such as PubMed Central full-text articles, Cochrane Library reviews, and clinical trial registries. This will test its ability to discern evidence quality in a realistic setting.\u003c/p\u003e\u003cp\u003eB. \u003cb\u003eDevelopment of a Learned Causal Scorer\u003c/b\u003e: Replace the rule-based evidence critic with a machine learning model trained to score the causal strength of a medical text passage. This model could be fine-tuned on a dataset of texts labeled by medical experts for their level of causal evidence (e.g., RCT, cohort study, case report).\u003c/p\u003e\u003cp\u003eC. \u003cb\u003eLow-Resource Language Focus\u003c/b\u003e: Return to the original problem statement by applying this framework to clinical questions in low-resource languages. This would involve creating or curating relevant knowledge bases in these languages and adapting the causal retrieval mechanisms accordingly.\u003c/p\u003e\u003cp\u003eD. \u003cb\u003eHuman-in-the-Loop Evaluation\u003c/b\u003e: Conduct formal user studies with clinicians and medical students in simulated low-resource settings. Key metrics would include perceived trust, usability, and the critical measure of whether the AI helps or hinders correct decision-making, especially when it expresses uncertainty.\u003c/p\u003e\u003cp\u003eE. \u003cb\u003eIntegration with Fine-Tuned Generators\u003c/b\u003e: Combine the causally-augmented retriever with a medical LLM generator (e.g., a fine-tuned Llama or Mistral model) that is explicitly trained to phrase answers with appropriate confidence levels based on the provided evidence.\u003c/p\u003e\u003cp\u003eBy pursuing these directions, the Causal-RAG paradigm can evolve from a promising prototype into a robust, deployable technology that truly enhances the safety and efficacy of clinical decision support for all populations, regardless of their resources.\u003c/p\u003e\u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eEthical Approval\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNot applicable. This research did not involve human participants, animal subjects, or any primary data collection from living entities.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting Interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing interests, financial or non-financial, relevant to the content of this article.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eFunding\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors received no specific funding for this work.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthorship Contribution\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNnaemeka KIngsley Ugwumba: Conceptualization, Methodology, Software, Writing - Original Draft.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003ePeter Sunday Jaja: Review and corrections\u003c/p\u003e\n\u003cp\u003eThe authors reviewed and approved the final manuscript.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData Availability Declaration\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eAll data generated or analysed during this study, including the figures and source code, are available in the following GitHub repository https://github.com/KingsleyTechie/Causal-RAG \u0026nbsp;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eHarry J (2024) Med-datasets [Data set]. Kaggle. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://www.kaggle.com/datasets/joshharry/med-datasets\u003c/span\u003e\u003cspan address=\"https://www.kaggle.com/datasets/joshharry/med-datasets\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eJin Q, Dhingra B, Liu Z, Cohen WW, Lu X (2019) PubMedQA: A dataset for biomedical research question answering. *Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)*, 2567\u0026ndash;2577. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.18653/v1/D19-1259\u003c/span\u003e\u003cspan address=\"10.18653/v1/D19-1259\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J (2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36(4):1234\u0026ndash;1240. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.1093/bioinformatics/btz682\u003c/span\u003e\u003cspan address=\"10.1093/bioinformatics/btz682\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eLewis P, Perez E, Piktus A, Petroni F, Karpukhin V, Goyal N, K\u0026uuml;ttler H, Lewis M, Yih W, Rockt\u0026auml;schel T, Riedel S, Kiela D (2020) Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in Neural Information Processing Systems, 33, 9459\u0026ndash;9474. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html\u003c/span\u003e\u003cspan address=\"https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eSinghal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, Scales N, Tanwani A, Cole-Lewis H, Pfohl S, Payne P, Seneviratne M, Gamble P, Kelly C, Babiker A, Sch\u0026auml;rli N, Chowdhery A, Mansfield P, Demner-Fushman D, Natarajan V (2023) Towards expert-level medical question answering with large language models. arXiv. https://arxiv.org/abs/2305.09617\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eUgwumba NK (2024) SocialGuard: An integrated framework for proactive fake account detection leveraging behavioral APIs and malicious URL profiling. \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.21203/rs.3.rs-8085397/v1\u003c/span\u003e\u003cspan address=\"10.21203/rs.3.rs-8085397/v1\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. Research Square (Preprint)\u003c/span\u003e\u003c/li\u003e\u003cli\u003e\u003cspan\u003eUgwumba NK, Jaja PS (2025) Enhanced task prioritization system using Deep-Q-Network model. Int J Comput Sci Eng Techniques 9(6). \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://doi.org/10.5281/zenodo.17636107\u003c/span\u003e\u003cspan address=\"10.5281/zenodo.17636107\" targettype=\"DOI\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e. IJCSE-V9I6P15\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"Laskenta Technologies Limited","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"clinical artificial intelligence, retrieval-augmented generation, causal inference, medical natural language processing, AI safety, healthcare AI, medical decision support, low-resource clinical settings, hallucination reduction, evidence-based AI, confidence calibration, clinical question answering","lastPublishedDoi":"10.21203/rs.3.rs-8163345/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8163345/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThis research addresses a critical challenge in using artificial intelligence for healthcare in low-resource settings: the tendency of AI models to produce confident but incorrect information, a phenomenon known as hallucination. We propose Causal-RAG, a novel framework that enhances standard Retrieval-Augmented Generation (RAG) by integrating principles of causal inference. The goal is to ground the AI's responses in robust, causally-relevant evidence rather than mere correlations. We built a prototype and tested it on a clinical question-answering task. Our findings reveal a fundamental trade-off: while a standard RAG system achieved high accuracy but displayed a dangerous 'yes' bias, our Causal-RAG approach successfully reduced this overconfidence, prioritizing safety. This work establishes a foundation for developing more trustworthy and reliable AI decision-support tools for clinical environments where data and expertise are scarce.\u003c/p\u003e","manuscriptTitle":"Causal-RAG: Causally-Augmented Retrieval for Hallucination-Free Clinical Decision Support in Low-Resource Settings","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-11-21 04:10:25","doi":"10.21203/rs.3.rs-8163345/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"2ebaabe4-3c99-4688-8619-ac5d173b08b4","owner":[],"postedDate":"November 21st, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":58311578,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2025-11-21T04:10:25+00:00","versionOfRecord":[],"versionCreatedAt":"2025-11-21 04:10:25","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8163345","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8163345","identity":"rs-8163345","version":["v1"]},"buildId":"8U1c8b4HqxoKbykW_rLl7","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00