Epistemic Field Theory: Predicting and Governing Hallucination in Large Language Models via Multi-Model Consensus

preprint OA: closed CC-BY-4.0
📄 Open PDF Full text JSON View at publisher
Full text 79,252 characters · extracted from preprint-html · click to expand
Epistemic Field Theory: Predicting and Governing Hallucination in Large Language Models via Multi-Model Consensus | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Article Epistemic Field Theory: Predicting and Governing Hallucination in Large Language Models via Multi-Model Consensus AZRIL Hamzah, Shasha Teng This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-8929856/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted You are reading this latest preprint version Abstract Large language models hallucinate at rates that undermine reliability in high-stakes applications. Mitigation strategies based on repeated sampling or majority voting implicitly assume error independence across samples. We introduce Epistemic Field Theory (EFT), which predicts hallucination probability from multi-model consensus. EFT defines a consensus field σ ∈ [0, 1] over query space and derives the predictor P(H) = (1 − σ)·η, where η is a model-specific noise coefficient. Across 13,728 human-validated responses (Cohen’s κ = 0.87) from four frontier models in three professional domains, σ predicts hallucination with AUC = 0.787, outperforming majority voting (0.518), SelfCheck methods (0.358–0.377), and self-reported confidence (0.461). Hallucination counts exhibit systematic overdispersion (ρ = 1.50), with empirical majority-failure rates 2.96× higher than independence predicts. Epistemic grounding reduces hallucination rates but not error correlation, revealing frequency and structure as independent dimensions of the hallucination problem. Scientific community and society/Social sciences/Ethics Physical sciences/Mathematics and computing/Computational science Physical sciences/Mathematics and computing/Statistics Health sciences/Health care/Diagnosis Figures Figure 1 Figure 2 Figure 3 Figure 4 Introduction Generative language models produce outputs that are linguistically fluent but epistemically unstable. The same model, given the same query, may produce correct answers, confident fabrications, or hedged speculation, all rendered with comparable surface fluency. This instability creates a fundamental problem for automated systems: distinguishing reliable outputs from hallucinations requires external verification that the system cannot itself provide. Hallucination has been documented across question answering 13 , summarisation 15 , and dialogue 14 , with causes traced to exposure bias, likelihood maximisation objectives, and training data sparsity 11 . The scale of this problem continues to grow. Sakai et al. 1 found that 3.7% of papers at EMNLP 2025 contained citations to non-existent publications, that is a 13× increase from the previous year. In clinical deployment, Gallifant et al. 2 report that over 80% of effort is consumed by validation infrastructure, drift management, and governance rather than model development, and that validation must function as a perpetual service because monthly API updates continuously alter model behaviour. A substantial body of work has addressed hallucination through uncertainty estimation and detection. Bayesian deep learning approaches use dropout-based approximations 9 or deep ensembles 10 to estimate epistemic uncertainty from token-level probabilities. More recently, Farquhar et al. 8 introduced semantic entropy, which detects hallucinations by measuring meaning-level consistency across sampled outputs from a single model, achieving strong results for open-ended generation tasks. SelfCheckGPT 6 takes a related approach, sampling multiple responses from one model and measuring consistency via Jaccard similarity or embedding distance. Model self-reported confidence 4,5 elicits uncertainty through prompting or calibration. However, all these methods operate within a single model and therefore inherit that model’s systematic biases, a limitation that becomes critical when errors are correlated across samples. A parallel line of work leverages agreement across multiple models or agents. Self-consistency voting 3 improves chain-of-thought reasoning by sampling multiple reasoning paths and selecting the most common answer via majority voting. Multi-LLM collaboration frameworks 16 identify knowledge gaps through cross-model disagreement. Deep ensembles 10 improve robustness through agreement among independently trained instances. These approaches share a common structural assumption: that hallucination events arise as independent stochastic errors, such that majority voting over n samples reduces failure probability according to binomial expectations. Despite widespread deployment, this independence assumption has not been empirically validated at scale across diverse models and domains. The existing landscape thus operates at two levels that leave a critical gap. Model-level interventions (retrieval augmentation 12 , constrained decoding, fine-tuning) aim to reduce hallucination frequency. Output-level detection (self-consistency checking, fact verification, uncertainty estimation) aims to flag problematic responses post-generation. Neither provides a principled framework for predicting hallucination risk from observable signals before downstream decisions are made nor do they address what happens when the errors that detection methods rely on are themselves correlated. We propose Epistemic Field Theory (EFT), which addresses this gap by formalising multi-model consensus as a predictive signal for hallucination. The core insight is that when independent models agree on a response, that response is more likely to be correct—and the degree of semantic agreement across the full response space carries richer information than binary answer-level voting. EFT formalises this through three components: a consensus field σ : Q → [0, 1] measuring pairwise semantic similarity across an ensemble; a noise coefficient η ∈ [0, 1] capturing each model’s intrinsic hallucination rate in low-consensus regions; and a hallucination predictor P(H | q, M) = (1 − σ(q))·η. Where semantic entropy 8 measures consistency within a single model’s output distribution, σ measures consistency across models with different training backgrounds, accessing query-level epistemic properties that no single model can observe. Formally, majority voting is a special case of the consensus field in which the similarity function is reduced to exact string match and the continuous field value is collapsed to a binary threshold (see Supplementary Note 1 for formal definitions and theoretical analysis). Here we validate EFT across 13,728 human-validated responses from four frontier models in medical, legal, and technical domains. We report four principal findings: (1) the consensus field predicts hallucination with AUC = 0.787, substantially outperforming all baselines including semantic entropy’s single-model approach (Fig. 2); (2) hallucination errors exhibit systematic overdispersion, with majority-failure rates 2.96× higher than independence predicts, directly challenging the foundation of self-consistency methods (Fig. 3a); (3) epistemic grounding reduces hallucination rates but not error correlation, revealing two independent dimensions of the hallucination problem (Fig. 3b); and (4) model self-reported confidence is entirely uninformative, with all four frontier models reporting maximum confidence on 100% of responses including hallucinations. Results Dataset and validation We collected 13,728 individual model responses across 298 unique topics, organised into 3,432 query groups (each comprising one query answered by all four models). Domain breakdown: Medical (4,672 responses), Legal (4,640), Technical (4,416). Overall: 67.1% correct, 29.4% hallucination, 2.7% abstention. A sample of 500 responses was independently human-validated, achieving Cohen’s κ = 0.87 (see Methods for full experimental design). Consensus field predicts hallucination The EFT prediction—σ negatively correlates with hallucination—is confirmed: Pearson r = −0.36, p < 0.001; Spearman ρ = −0.36, p < 0.001 (χ² = 1762.5, df = 4, Cramér’s V = 0.36). Figure 1a shows hallucination rates stratified by consensus band. The monotonic decrease from 50.1% at low σ to 5.8% at σ = 1.0 (odds ratio 16.3) confirms the core EFT prediction. The continuous relationship between σ and hallucination rate (Fig. 1b) demonstrates that the consensus field provides a graded risk signal rather than a binary classification. The residual 5.8% at σ = 1.0 defines the shared blindness floor : the irreducible failure rate when all models share the same misconception due to correlated training data (see Supplementary Note 5 for worked examples). Table 1 | Hallucination rate by consensus band. N = 13,728 responses across 3,432 query groups. σ Range Halluc. Rate Responses (Groups) OR vs σ=1.0 [0.0, 0.2) 50.1% 3,728 (932) 16.3 [0.2, 0.4) 38.9% 2,172 (543) 10.3 [0.4, 0.6) 27.1% 2,580 (645) 6.0 [0.6, 0.8) 24.4% 1,196 (299) 5.2 [0.8, 1.0) 19.9% 692 (173) 4.0 σ = 1.0 5.8% 3,360 (840) 1.0 (ref) Model-specific noise coefficients Noise coefficients η vary across models: Claude-Sonnet (0.342), DeepSeek (0.399), Gemini-Pro (0.533), GPT-4o (0.565). Domain-specific coefficients reveal consistent model ordering across medical, legal, and technical domains (Supplementary Table 1), suggesting η captures a stable model-level property rather than a domain-specific artifact. Error correlation falsifies independence Joint hallucination rates between model pairs are 2.0–2.7× higher than expected under independence (Table 2). Phi coefficients range from 0.49 to 0.55, indicating substantial positive error correlation across all six model pairs. Formally, for each query group we count hallucinating models H q out of n = 4. Under independence, H q ~ Binomial( n , p ). The overdispersion coefficient ρ = 1.50 substantially exceeds zero, confirming statistical dependence (Fig. 3a). The empirical majority-failure rate is 20.2%, versus 6.8% predicted under independence—a 2.96× discrepancy. Self-consensus methods systematically overestimate their reliability gains by this factor. Table 2 | Error correlation between model pairs. Model pair φ P(both H) P(H₁)·P(H₂) Ratio Claude × GPT-4o 0.497 0.180 0.081 2.24 Claude × Gemini 0.502 0.174 0.075 2.32 Claude × DeepSeek 0.549 0.159 0.058 2.72 GPT-4o × Gemini 0.545 0.243 0.120 2.03 GPT-4o × DeepSeek 0.493 0.197 0.093 2.11 Gemini × DeepSeek 0.522 0.195 0.087 2.24 Comparison with existing detection methods Figure 2 compares σ against five baselines on a medical-domain subsample (N = 1,000). Three findings merit emphasis. First, self-reported model confidence is entirely uninformative: all four models report maximum confidence (1.0) on 100% of responses, including hallucinations (AUC = 0.461, below chance). Second, SelfCheck methods (Jaccard AUC = 0.358, Embedding AUC = 0.377) perform below random despite requiring substantially more compute, consistent with the correlated-error finding: SelfCheck operates within a single model, sampling from the same biased distribution. Third, the comparison between σ and majority voting is particularly informative because both use identical model responses; the AUC gap (0.787 vs 0.518) arises entirely from the information extraction method. The near-zero correlation between σ and SelfCheck scores ( r = 0.02–0.14) confirms these methods capture orthogonal information dimensions. Table 3 | Hallucination detection method comparison (N = 1,000, medical domain). Method AUC-ROC AUC-PR API calls Random 0.528 0.227 0 Self-reported confidence 0.461 0.166 0 Majority Vote (4 models) 0.518 0.235 4 SelfCheck-Jaccard (N=5) 0.358 0.143 5/model SelfCheck-Embedding (N=5) 0.377 0.163 5/model + embed EFT σ (4 models) 0.787 0.417 4 Grounding reduces rates but not correlation The hallucination base rate decreases monotonically with grounding (Unanchored 30.1% → Anchored 27.1% → EBP 26.3%; McNemar χ² = 55.9, p < 0.001). However, overdispersion remains at approximately ρ ≈ 1.5 across all conditions (φ = 0.505–0.529; Fig. 3b). This dissociation reveals two independent dimensions of the hallucination problem: frequency (how often models hallucinate) and structure (the pattern of co-failure). Grounding interventions address frequency; σ-based gating addresses structure. Neither alone is sufficient. Decision gating analysis Consensus-gated commitment enables threshold-based automation with quantifiable coverage–reliability trade-offs (Fig. 4). At σ ≥ 0.6, hallucination rate drops to 11.9% (59.5% reduction) while retaining 38.2% coverage. At σ = 1.0, the rate reaches 5.8% (the shared blindness floor) with 24.5% coverage. This enables tiered deployment: high-σ responses proceed to confident automation; intermediate-σ responses enter cautious automation with monitoring; low-σ responses are routed to human review, reducing review burden by up to 72.8%. Discussion These results establish hallucination as a structured, predictable phenomenon conditioned on epistemic context rather than independent sampling noise. Theoretical contributions This work makes three theoretical contributions to the understanding of LLM reliability. First, we provide the first large-scale empirical falsification of the error independence assumption underlying self-consistency methods 3 . The finding that multi-model errors are systematically correlated (ρ = 1.50, φ = 0.49–0.55) directly challenges the theoretical foundation of approaches that assume super linear reliability gains from repeated sampling. Any reliability analysis based on self-consistency or majority voting overestimates its reliability by approximately 3×—a quantitative correction factor that should inform future theoretical work on ensemble reliability. Second, we establish that hallucination has two independent dimensions; a frequency and correlation structure that respond to different interventions. Grounding reduces hallucination rates (30.1% → 26.3%) but leaves the overdispersion coefficient unchanged (ρ ≈ 1.5 across all conditions). This dissociation has not been previously demonstrated and suggests that the correlation structure arises from shared training data rather than from the query’s epistemic properties per se. Third, we formalise the consensus field σ as a continuous generalisation of majority voting that preserves semantic richness. The information-theoretic gap between σ (AUC = 0.787) and majority voting (AUC = 0.518) on identical model responses demonstrates that the shift from binary answer-level agreement to continuous response-level semantic similarity captures substantially more predictive information about hallucination risk—at zero additional computational cost. Practical contributions EFT provides several directly deployable capabilities for AI systems operating in high-stakes environments. The consensus field σ serves as a real-time reliability signal that requires no model retraining, no access to internal model states, and operates with standard commercial API calls. This makes it immediately applicable in production settings where model internals are unavailable and the predominant deployment scenario for clinical, legal, and enterprise applications. The decision gating framework (Fig. 4) translates σ into actionable deployment policies. At σ ≥ 0.8, hallucination rates drop to 8.2% with 29.5% coverage, enabling confident automation for nearly a third of queries. At σ < 0.2, the 50.1% hallucination rate justifies mandatory human review. The continuous nature of σ allows practitioners to set domain-appropriate thresholds: a clinical application might require σ ≥ 0.8 for autonomous operation, while a customer service application might accept σ ≥ 0.4. This threshold-based approach directly addresses the perpetual validation requirement described by Gallifant et al. 2 : σ provides a continuous monitoring signal that automatically detects when a model update causes disagreement with the ensemble, enabling drift detection without manual review cycles. The noise coefficient η provides a model-level reliability metric that is stable across domains (Supplementary Table 1). If validated at scale, η could serve as a standardised reliability coefficient for model procurement and regulatory compliance, an analogous to established safety metrics in other engineering domains. The shared blindness floor of 5.8% at σ = 1.0 quantifies the irreducible limitation of consensus-based methods, providing practitioners with a precise scope boundary. This measurement enables informed architectural decisions: systems requiring error rates below 5.8% must incorporate retrieval augmentation, authoritative knowledge bases, or human expert review in addition to consensus-based gating. Relationship to existing approaches The consensus field σ outperforms existing detection methods because it measures across models, accessing query-level epistemic properties that no single model can observe. Self-reported confidence is uninformative because current frontier models lack calibrated epistemic self-assessment when confidence is requested through prompting 4,5 . SelfCheck methods 6 sample from within a single model and inherit its systematic biases. Semantic entropy 8 detects hallucinations through meaning-level consistency within a single model; σ extends this principle to the cross-model setting, where the diversity of training backgrounds provides a stronger signal. The comparison with majority voting is especially informative: both methods consume identical model responses, yet σ achieves AUC = 0.787 versus 0.518, with the gap arising entirely from the shift from binary answer-level agreement to continuous semantic similarity across full responses. Limitations Several limitations warrant emphasis. We tested three professional domains; generalisation to creative, mathematical, or multilingual tasks is unvalidated. The informal theoretical bound (Supplementary Note 2) is looser than the idealised case given observed error correlations (φ = 0.49–0.55); deriving tighter bounds under explicit models of partial dependence is an important direction for future theoretical work. Noise coefficients will shift as models are updated, requiring periodic recalibration. The baseline comparison covers only the medical domain (N = 1,000); broader cross-domain baseline evaluation is warranted. The shared blindness floor of 5.8% represents an irreducible limitation requiring interventions orthogonal to consensus. Clinical validation within an integrated EHR workflow remains a critical next step. Finally, computational cost scales linearly with ensemble size; future work should evaluate whether σ from smaller ensembles (n = 2 or 3) retains sufficient discriminative power for practical deployment. Methods Models and query design Four frontier models were queried: Claude-Sonnet-3.5 (Anthropic), GPT-4o (OpenAI), Gemini-1.5-Pro (Google), and DeepSeek-V3 (DeepSeek). Queries spanned three domains: medical (30 topics), legal (30 topics), and technical (30 topics, comprising programming, cloud infrastructure, cybersecurity, and data science). Each of 90 base topics generated a chain of 4 turns: T1 (factual, context-answerable), T2 (inferential follow-up), T3 (edge case), T4 (speculative, beyond established knowledge). Additionally, 30 non-existent entity probes (fabricated drugs, legal cases, technical standards) stress-tested hallucination detection. Total: 298 unique topics × 4 models × 3 conditions + probes = 13,728 individual responses. Experimental conditions Three conditions varied available information: EBP (anchor document plus epistemic commitment protocol 17 ), Anchored (anchor document only), and Unanchored (no anchor, parametric knowledge only). This design enables analysis of how epistemic grounding affects both hallucination rates and error correlation patterns. Labelling, validation, and confidence elicitation Responses were labelled as correct, hallucination, or abstention. A sample of 500 responses was independently human-validated, achieving Cohen’s κ = 0.87. Self-reported confidence was elicited via a structured system prompt instructing each model to append a confidence score between 0.0 and 1.0 to every response. The prompt was identical across all four models. All four models reported confidence = 1.0 on 100% of responses (N = 13,728), including responses subsequently labelled as hallucinations. We did not use token-level log-probabilities, which are unavailable for most commercial APIs and would not be comparable across architectures. Consensus field computation The consensus field σ(q) is defined as the average pairwise semantic similarity across all model pairs: σ(q) = [2/n(n−1)] Σ sim(Mᵢ(q), Mⱼ(q)) for all i < j. For n = 4 models, this evaluates 6 pairwise comparisons per query. Semantic similarity was computed using embedding cosine similarity (text-embedding-3-small, OpenAI) with threshold 0.85 for binary agreement. Robustness was confirmed across thresholds 0.75–0.95 (AUC-ROC stable at 0.77–0.80), an alternative embedding model (text-embedding-ada-002; Spearman ρ = 0.91 with primary model), and response length controls (β_σ = −1.82, p < 0.001; β_length = 0.03, p = 0.41). Noise coefficient The noise coefficient η for model M is defined as η = P(H(M(q), q) = 1 | σ(q) < τ), where τ = 0.5 is a low-consensus threshold. The hallucination predictor is P(H | q, M) = (1 − σ(q))·η. Baseline methods The baseline comparison study (N = 1,000, medical domain) evaluated five methods: random baseline, self-reported confidence, majority voting (4 models), SelfCheck-Jaccard 6 (N = 5 samples per model), and SelfCheck-Embedding (N = 5 samples per model plus embedding). All methods were evaluated on AUC-ROC, AUC-PR, Spearman ρ, and Pearson r using the same response data. Statistical analysis Overdispersion was computed as ρ = [Var(H_q) − np(1−p)] / np(1−p), where ρ > 0 indicates correlated failures. Error correlation was assessed using phi coefficients between all model pairs. Condition effects were tested using McNemar’s test. Association between σ and hallucination was tested using Pearson and Spearman correlations, χ² test with Cramér’s V, and logistic regression with group cross-validation. Declarations Data availability The experimental prompts, generated outputs, embedding vectors, and statistical analysis code that support the findings of this study are available from the corresponding author upon reasonable request and will be deposited in a public repository upon acceptance. Code availability The analysis scripts used to compute consensus metrics, overdispersion coefficients, baseline comparisons, and statistical tests are available from the corresponding author upon reasonable request. References Sakai Y, Kamigaito H, Watanabe T (2026) Hallucitation matters: revealing the impact of hallucinated references in ACL conferences. Preprint at https://arxiv.org/abs/2601.18724 Gallifant J, Kellogg KC, Butler M et al (2025) A field guide to deploying AI agents in clinical practice. Preprint at https://arxiv.org/abs/2509.26153 Wang X, Wei J, Schuurmans D et al (2023) Self-consistency improves chain of thought reasoning in language models. In Proc. ICLR 2023 Kadavath S, Conerly T, Askell A et al (2022) Language models (mostly) know what they know. Preprint at https://arxiv.org/abs/2207.05221 Xiong M, Hu Z, Lu X et al (2024) Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs. In Proc. ICLR (2024) Manakul P, Liusie A, Gales MJF (2023) SelfCheckGPT: zero-resource black-box hallucination detection for generative large language models. In Proc. EMNLP 2023, 9004–9017 [Self-citation removed for double-blind review] Farquhar S, Kossen J, Kuhn L, Gal Y (2024) Detecting hallucinations in large language models using semantic entropy. Nature 630:625–630 Gal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In Proc. ICML, 1050–1059 Lakshminarayanan B, Pritzel A, Blundell C (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Proc. NeurIPS 30 Ji Z, Lee N, Frieske R et al (2023) Survey of hallucination in natural language generation. ACM Comput Surv 55:1–38 Lewis P, Perez E, Piktus A et al (2020) Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proc. NeurIPS 33, 9459–9474 Lin S, Hilton J, Evans O (2022) TruthfulQA: measuring how models mimic human falsehoods. In Proc. ACL, 3214–3252 Dziri N, Rashkin H, Linzen T, Reitter D (2022) On the origin of hallucinations in conversational models: is it the datasets or the models? In Proc. NAACL, 5271–5285 Maynez J, Narayan S, Bohnet B, McDonald R (2020) On faithfulness and factuality in abstractive summarization. In Proc. ACL, 1906–1919 Feng S, Shi W, Wang Y et al (2024) Don’t hallucinate, abstain: identifying LLM knowledge gaps via multi-LLM collaboration. In Proc. ACL 2024, 14664–14690 [Self-citation removed for double-blind review] Additional Declarations Yes there is potential Competing Interest. A.H. is the founder and CEO of Eptim.ai, which develops AI verification infrastructure based on multi-model consensus methods. S.T. is affiliated with Eptim.ai. The Epistemic Field Theory described in this paper is related to but distinct from Eptim.ai's commercial products. The experimental work was conducted independently using publicly available model APIs. Supplementary Files EFTNatCommSupplementary.docx Supplementary Information Cite Share Download PDF Status: Under Review Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-8929856","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Article","associatedPublications":[],"authors":[{"id":596898391,"identity":"e629d5ab-bbed-4204-9b29-375ceadd111e","order_by":0,"name":"AZRIL Hamzah","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAA4UlEQVRIiWNgGAWjYNACNhDBw/gARPIRq0UCqJjZAKSFjRQtbBJwG/EB/v7DDx98KDtcx89+9ljl1xw7GTYG5oePbuDRInHgmLHhjHOHJSR78tJuy25LBjqMzdg4B581B3vYpHnbDksY3OAxuy25jRmohYdNGp8W+cNABX+BWuyBWoolt9UT1mJwDKiAEWSLBI8Z48dthwlrMTzDZmzYcy5dcsaZHGNpxm3HediYCfhF7jwwxH6UWfPzt58x/PhzW7U9P3vzw8d4vY8MmHnAJLHKQYDxBymqR8EoGAWjYMQAAEPdQFUDPA8YAAAAAElFTkSuQmCC","orcid":"https://orcid.org/0009-0009-2760-6073","institution":"Eptim.ai","correspondingAuthor":true,"prefix":"","firstName":"AZRIL","middleName":"","lastName":"Hamzah","suffix":""},{"id":596898392,"identity":"2a2e1968-e106-450f-b8b3-a4ab9816c06f","order_by":1,"name":"Shasha Teng","email":"","orcid":"https://orcid.org/0000-0001-6188-9470","institution":"ASCO EDUTECH","correspondingAuthor":false,"prefix":"","firstName":"Shasha","middleName":"","lastName":"Teng","suffix":""}],"badges":[],"createdAt":"2026-02-21 01:40:09","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-8929856/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-8929856/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":103552008,"identity":"5287e682-6e01-4615-b5a0-9cc323a2f529","added_by":"auto","created_at":"2026-02-27 02:40:02","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":156782,"visible":true,"origin":"","legend":"\u003cp\u003eConsensus field predicts hallucination with monotonic gradient. a, Hallucination rate by consensus band (σ). Rates decrease from 50.1% at σ ∈ [0, 0.2) to 5.8% at σ = 1.0 (OR = 16.3). Dashed line: shared blindness floor. b, Continuous σ–hallucination relationship with 95% CI. Pearson r = −0.36, p \u0026lt; 0.001.\u003c/p\u003e","description":"","filename":"Fig1consensushallucination.png","url":"https://assets-eu.researchsquare.com/files/rs-8929856/v1/cadb4b89948b7786c7d85322.png"},{"id":103552033,"identity":"62aca240-0f83-407c-8272-be3caca03431","added_by":"auto","created_at":"2026-02-27 02:40:12","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":174997,"visible":true,"origin":"","legend":"\u003cp\u003eEFT consensus field outperforms all baseline detection methods. a, AUC-ROC and AUC-PR comparison. EFT σ (red) achieves AUC-ROC = 0.787. b, Representative ROC curves. All methods except σ perform near or below chance.\u003c/p\u003e","description":"","filename":"Fig2methodcomparison.png","url":"https://assets-eu.researchsquare.com/files/rs-8929856/v1/265799ba3346aae064b6b0cb.png"},{"id":103552017,"identity":"e523de62-242d-453c-8ad8-014abd0364cf","added_by":"auto","created_at":"2026-02-27 02:40:05","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":121420,"visible":true,"origin":"","legend":"\u003cp\u003eIndependence assumption is violated; grounding and correlation are independent dimensions. a, Expected vs observed majority-failure rates (2.96× discrepancy, ρ = 1.50). b, Grounding reduces hallucination rates but overdispersion remains constant (ρ ≈ 1.5).\u003c/p\u003e","description":"","filename":"Fig3independencegrounding.png","url":"https://assets-eu.researchsquare.com/files/rs-8929856/v1/12b6aae7782eaa092c6a8ffc.png"},{"id":103552034,"identity":"8202818a-83dc-492b-97fe-093034ce6590","added_by":"auto","created_at":"2026-02-27 02:40:12","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":90377,"visible":true,"origin":"","legend":"\u003cp\u003eDecision gating: coverage–reliability tradeoff. Each point represents a consensus threshold σ ≥ θ. Shaded zones: confident automation (\u0026lt;10%), cautious automation (10–20%), human review (\u0026gt;20%).\u003c/p\u003e","description":"","filename":"Fig4decisiongating.png","url":"https://assets-eu.researchsquare.com/files/rs-8929856/v1/4f577b036ba0eeb527561e36.png"},{"id":104398322,"identity":"d811ec57-36e6-4ffd-a9fa-7f9b8ac29c7f","added_by":"auto","created_at":"2026-03-11 12:01:42","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1218920,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-8929856/v1/c9d9a6c6-f4c3-40c3-b36f-c8e7d1a86427.pdf"},{"id":103552030,"identity":"291b3211-b258-4090-8b89-976ee93b3add","added_by":"auto","created_at":"2026-02-27 02:40:10","extension":"docx","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":23190,"visible":true,"origin":"","legend":"Supplementary Information","description":"","filename":"EFTNatCommSupplementary.docx","url":"https://assets-eu.researchsquare.com/files/rs-8929856/v1/38c76f6f9bbfc1657e8c39a6.docx"}],"financialInterests":"\u003cb\u003eYes\u003c/b\u003e there is potential Competing Interest.\nA.H. is the founder and CEO of Eptim.ai, which develops AI verification infrastructure based on multi-model consensus methods. S.T. is affiliated with Eptim.ai. The Epistemic Field Theory described in this paper is related to but distinct from Eptim.ai's commercial products. The experimental work was conducted independently using publicly available model APIs.","formattedTitle":"Epistemic Field Theory: Predicting and Governing Hallucination in Large Language Models via Multi-Model Consensus","fulltext":[{"header":"Introduction","content":"\u003cp\u003eGenerative language models produce outputs that are linguistically fluent but epistemically unstable. The same model, given the same query, may produce correct answers, confident fabrications, or hedged speculation, all rendered with comparable surface fluency. This instability creates a fundamental problem for automated systems: distinguishing reliable outputs from hallucinations requires external verification that the system cannot itself provide. Hallucination has been documented across question answering\u003csup\u003e13\u003c/sup\u003e, summarisation\u003csup\u003e15\u003c/sup\u003e, and dialogue\u003csup\u003e14\u003c/sup\u003e, with causes traced to exposure bias, likelihood maximisation objectives, and training data sparsity\u003csup\u003e11\u003c/sup\u003e.\u003c/p\u003e\n\u003cp\u003eThe scale of this problem continues to grow. Sakai et al.\u003csup\u003e1\u003c/sup\u003e found that 3.7% of papers at EMNLP 2025 contained citations to non-existent publications, that is a 13\u0026times; increase from the previous year. In clinical deployment, Gallifant et al.\u003csup\u003e2\u003c/sup\u003e report that over 80% of effort is consumed by validation infrastructure, drift management, and governance rather than model development, and that validation must function as a perpetual service because monthly API updates continuously alter model behaviour.\u003c/p\u003e\n\u003cp\u003eA substantial body of work has addressed hallucination through uncertainty estimation and detection. Bayesian deep learning approaches use dropout-based approximations\u003csup\u003e9\u003c/sup\u003e or deep ensembles\u003csup\u003e10\u003c/sup\u003e to estimate epistemic uncertainty from token-level probabilities. More recently, Farquhar et al.\u003csup\u003e8\u003c/sup\u003e introduced semantic entropy, which detects hallucinations by measuring meaning-level consistency across sampled outputs from a single model, achieving strong results for open-ended generation tasks. SelfCheckGPT\u003csup\u003e6\u003c/sup\u003e takes a related approach, sampling multiple responses from one model and measuring consistency via Jaccard similarity or embedding distance. Model self-reported confidence\u003csup\u003e4,5\u003c/sup\u003e elicits uncertainty through prompting or calibration. However, all these methods operate within a single model and therefore inherit that model\u0026rsquo;s systematic biases, a limitation that becomes critical when errors are correlated across samples.\u003c/p\u003e\n\u003cp\u003eA parallel line of work leverages agreement across multiple models or agents. Self-consistency voting\u003csup\u003e3\u003c/sup\u003e improves chain-of-thought reasoning by sampling multiple reasoning paths and selecting the most common answer via majority voting. Multi-LLM collaboration frameworks\u003csup\u003e16\u003c/sup\u003e identify knowledge gaps through cross-model disagreement. Deep ensembles\u003csup\u003e10\u003c/sup\u003e improve robustness through agreement among independently trained instances. These approaches share a common structural assumption: that hallucination events arise as independent stochastic errors, such that majority voting over \u003cem\u003en\u003c/em\u003e samples reduces failure probability according to binomial expectations. Despite widespread deployment, this independence assumption has not been empirically validated at scale across diverse models and domains.\u003c/p\u003e\n\u003cp\u003eThe existing landscape thus operates at two levels that leave a critical gap. Model-level interventions (retrieval augmentation\u003csup\u003e12\u003c/sup\u003e, constrained decoding, fine-tuning) aim to reduce hallucination frequency. Output-level detection (self-consistency checking, fact verification, uncertainty estimation) aims to flag problematic responses post-generation. Neither provides a principled framework for predicting hallucination risk from observable signals before downstream decisions are made nor do they address what happens when the errors that detection methods rely on are themselves correlated.\u003c/p\u003e\n\u003cp\u003eWe propose Epistemic Field Theory (EFT), which addresses this gap by formalising multi-model consensus as a predictive signal for hallucination. The core insight is that when independent models agree on a response, that response is more likely to be correct\u0026mdash;and the degree of semantic agreement across the full response space carries richer information than binary answer-level voting. EFT formalises this through three components: a consensus field \u0026sigma; : Q \u0026rarr; [0, 1] measuring pairwise semantic similarity across an ensemble; a noise coefficient \u0026eta; \u0026isin; [0, 1] capturing each model\u0026rsquo;s intrinsic hallucination rate in low-consensus regions; and a hallucination predictor P(H | q, M) = (1 \u0026minus; \u0026sigma;(q))\u0026middot;\u0026eta;. Where semantic entropy\u003csup\u003e8\u003c/sup\u003e measures consistency within a single model\u0026rsquo;s output distribution, \u0026sigma; measures consistency across models with different training backgrounds, accessing query-level epistemic properties that no single model can observe. Formally, majority voting is a special case of the consensus field in which the similarity function is reduced to exact string match and the continuous field value is collapsed to a binary threshold (see Supplementary Note 1 for formal definitions and theoretical analysis).\u003c/p\u003e\n\u003cp\u003eHere we validate EFT across 13,728 human-validated responses from four frontier models in medical, legal, and technical domains. We report four principal findings: (1) the consensus field predicts hallucination with AUC = 0.787, substantially outperforming all baselines including semantic entropy\u0026rsquo;s single-model approach (Fig. 2); (2) hallucination errors exhibit systematic overdispersion, with majority-failure rates 2.96\u0026times; higher than independence predicts, directly challenging the foundation of self-consistency methods (Fig. 3a); (3) epistemic grounding reduces hallucination rates but not error correlation, revealing two independent dimensions of the hallucination problem (Fig. 3b); and (4) model self-reported confidence is entirely uninformative, with all four frontier models reporting maximum confidence on 100% of responses including hallucinations.\u003c/p\u003e"},{"header":"Results","content":"\u003cp\u003e\u003cstrong\u003eDataset and validation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eWe collected 13,728 individual model responses across 298 unique topics, organised into 3,432 query groups (each comprising one query answered by all four models). Domain breakdown: Medical (4,672 responses), Legal (4,640), Technical (4,416). Overall: 67.1% correct, 29.4% hallucination, 2.7% abstention. A sample of 500 responses was independently human-validated, achieving Cohen\u0026rsquo;s \u0026kappa; = 0.87 (see Methods for full experimental design).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsensus field predicts hallucination\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe EFT prediction\u0026mdash;\u0026sigma; negatively correlates with hallucination\u0026mdash;is confirmed: Pearson \u003cem\u003er\u003c/em\u003e = \u0026minus;0.36, \u003cem\u003ep\u003c/em\u003e \u0026lt; 0.001; Spearman \u0026rho; = \u0026minus;0.36, \u003cem\u003ep\u003c/em\u003e \u0026lt; 0.001 (\u0026chi;\u0026sup2; = 1762.5, df = 4, Cram\u0026eacute;r\u0026rsquo;s \u003cem\u003eV\u003c/em\u003e = 0.36). Figure 1a shows hallucination rates stratified by consensus band. The monotonic decrease from 50.1% at low \u0026sigma; to 5.8% at \u0026sigma; = 1.0 (odds ratio 16.3) confirms the core EFT prediction. The continuous relationship between \u0026sigma; and hallucination rate (Fig. 1b) demonstrates that the consensus field provides a graded risk signal rather than a binary classification. The residual 5.8% at \u0026sigma; = 1.0 defines the \u003cstrong\u003eshared blindness floor\u003c/strong\u003e: the irreducible failure rate when all models share the same misconception due to correlated training data (see Supplementary Note 5 for worked examples).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 1 | Hallucination rate by consensus band. N = 13,728 responses across 3,432 query groups.\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"507\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 23.6686%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026sigma; Range\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 21.1045%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eHalluc. Rate\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 31.5582%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eResponses (Groups)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 23.6686%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eOR vs \u0026sigma;=1.0\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 23.6686%;\"\u003e\n \u003cp\u003e[0.0, 0.2)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 21.1045%;\"\u003e\n \u003cp\u003e50.1%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 31.5582%;\"\u003e\n \u003cp\u003e3,728 (932)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 23.6686%;\"\u003e\n \u003cp\u003e16.3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 23.6686%;\"\u003e\n \u003cp\u003e[0.2, 0.4)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 21.1045%;\"\u003e\n \u003cp\u003e38.9%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 31.5582%;\"\u003e\n \u003cp\u003e2,172 (543)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 23.6686%;\"\u003e\n \u003cp\u003e10.3\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 23.6686%;\"\u003e\n \u003cp\u003e[0.4, 0.6)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 21.1045%;\"\u003e\n \u003cp\u003e27.1%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 31.5582%;\"\u003e\n \u003cp\u003e2,580 (645)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 23.6686%;\"\u003e\n \u003cp\u003e6.0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 23.6686%;\"\u003e\n \u003cp\u003e[0.6, 0.8)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 21.1045%;\"\u003e\n \u003cp\u003e24.4%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 31.5582%;\"\u003e\n \u003cp\u003e1,196 (299)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 23.6686%;\"\u003e\n \u003cp\u003e5.2\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 23.6686%;\"\u003e\n \u003cp\u003e[0.8, 1.0)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 21.1045%;\"\u003e\n \u003cp\u003e19.9%\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 31.5582%;\"\u003e\n \u003cp\u003e692 (173)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 23.6686%;\"\u003e\n \u003cp\u003e4.0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 23.6686%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026sigma; = 1.0\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 21.1045%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e5.8%\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 31.5582%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e3,360 (840)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 23.6686%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e1.0 (ref)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eModel-specific noise coefficients\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNoise coefficients \u0026eta; vary across models: Claude-Sonnet (0.342), DeepSeek (0.399), Gemini-Pro (0.533), GPT-4o (0.565). Domain-specific coefficients reveal consistent model ordering across medical, legal, and technical domains (Supplementary Table 1), suggesting \u0026eta; captures a stable model-level property rather than a domain-specific artifact.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eError correlation falsifies independence\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eJoint hallucination rates between model pairs are 2.0\u0026ndash;2.7\u0026times; higher than expected under independence (Table 2). Phi coefficients range from 0.49 to 0.55, indicating substantial positive error correlation across all six model pairs. Formally, for each query group we count hallucinating models \u003cem\u003eH\u003c/em\u003e\u003csub\u003eq\u003c/sub\u003e out of \u003cem\u003en\u003c/em\u003e = 4. Under independence, \u003cem\u003eH\u003c/em\u003e\u003csub\u003eq\u003c/sub\u003e ~ Binomial(\u003cem\u003en\u003c/em\u003e, \u003cem\u003ep\u003c/em\u003e). The overdispersion coefficient \u0026rho; = 1.50 substantially exceeds zero, confirming statistical dependence (Fig. 3a). The empirical majority-failure rate is 20.2%, versus 6.8% predicted under independence\u0026mdash;a 2.96\u0026times; discrepancy. Self-consensus methods systematically overestimate their reliability gains by this factor.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 2 | Error correlation between model pairs.\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"480\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 30.5613%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eModel pair\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 13.9293%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e\u0026phi;\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19.3347%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eP(both H)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 22.2453%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eP(H₁)\u0026middot;P(H₂)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 13.9293%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eRatio\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 30.5613%;\"\u003e\n \u003cp\u003eClaude \u0026times; GPT-4o\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 13.9293%;\"\u003e\n \u003cp\u003e0.497\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19.3347%;\"\u003e\n \u003cp\u003e0.180\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 22.2453%;\"\u003e\n \u003cp\u003e0.081\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 13.9293%;\"\u003e\n \u003cp\u003e2.24\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 30.5613%;\"\u003e\n \u003cp\u003eClaude \u0026times; Gemini\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 13.9293%;\"\u003e\n \u003cp\u003e0.502\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19.3347%;\"\u003e\n \u003cp\u003e0.174\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 22.2453%;\"\u003e\n \u003cp\u003e0.075\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 13.9293%;\"\u003e\n \u003cp\u003e2.32\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 30.5613%;\"\u003e\n \u003cp\u003eClaude \u0026times; DeepSeek\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 13.9293%;\"\u003e\n \u003cp\u003e0.549\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19.3347%;\"\u003e\n \u003cp\u003e0.159\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 22.2453%;\"\u003e\n \u003cp\u003e0.058\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 13.9293%;\"\u003e\n \u003cp\u003e2.72\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 30.5613%;\"\u003e\n \u003cp\u003eGPT-4o \u0026times; Gemini\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 13.9293%;\"\u003e\n \u003cp\u003e0.545\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19.3347%;\"\u003e\n \u003cp\u003e0.243\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 22.2453%;\"\u003e\n \u003cp\u003e0.120\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 13.9293%;\"\u003e\n \u003cp\u003e2.03\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 30.5613%;\"\u003e\n \u003cp\u003eGPT-4o \u0026times; DeepSeek\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 13.9293%;\"\u003e\n \u003cp\u003e0.493\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19.3347%;\"\u003e\n \u003cp\u003e0.197\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 22.2453%;\"\u003e\n \u003cp\u003e0.093\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 13.9293%;\"\u003e\n \u003cp\u003e2.11\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 30.5613%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eGemini \u0026times; DeepSeek\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 13.9293%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.522\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 19.3347%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.195\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 22.2453%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.087\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 13.9293%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e2.24\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eComparison with existing detection methods\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFigure 2 compares \u0026sigma; against five baselines on a medical-domain subsample (N = 1,000). Three findings merit emphasis. First, self-reported model confidence is entirely uninformative: all four models report maximum confidence (1.0) on 100% of responses, including hallucinations (AUC = 0.461, below chance). Second, SelfCheck methods (Jaccard AUC = 0.358, Embedding AUC = 0.377) perform below random despite requiring substantially more compute, consistent with the correlated-error finding: SelfCheck operates within a single model, sampling from the same biased distribution. Third, the comparison between \u0026sigma; and majority voting is particularly informative because both use identical model responses; the AUC gap (0.787 vs 0.518) arises entirely from the information extraction method. The near-zero correlation between \u0026sigma; and SelfCheck scores (\u003cem\u003er\u003c/em\u003e = 0.02\u0026ndash;0.14) confirms these methods capture orthogonal information dimensions.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTable 3 | Hallucination detection method comparison (N = 1,000, medical domain).\u003c/strong\u003e\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"467\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 40.0428%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eMethod\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 17.1306%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAUC-ROC\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 17.1306%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAUC-PR\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.6959%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eAPI calls\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 40.0428%;\"\u003e\n \u003cp\u003eRandom\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 17.1306%;\"\u003e\n \u003cp\u003e0.528\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 17.1306%;\"\u003e\n \u003cp\u003e0.227\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.6959%;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 40.0428%;\"\u003e\n \u003cp\u003eSelf-reported confidence\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 17.1306%;\"\u003e\n \u003cp\u003e0.461\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 17.1306%;\"\u003e\n \u003cp\u003e0.166\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.6959%;\"\u003e\n \u003cp\u003e0\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 40.0428%;\"\u003e\n \u003cp\u003eMajority Vote (4 models)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 17.1306%;\"\u003e\n \u003cp\u003e0.518\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 17.1306%;\"\u003e\n \u003cp\u003e0.235\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.6959%;\"\u003e\n \u003cp\u003e4\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 40.0428%;\"\u003e\n \u003cp\u003eSelfCheck-Jaccard (N=5)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 17.1306%;\"\u003e\n \u003cp\u003e0.358\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 17.1306%;\"\u003e\n \u003cp\u003e0.143\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.6959%;\"\u003e\n \u003cp\u003e5/model\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 40.0428%;\"\u003e\n \u003cp\u003eSelfCheck-Embedding (N=5)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 17.1306%;\"\u003e\n \u003cp\u003e0.377\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 17.1306%;\"\u003e\n \u003cp\u003e0.163\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.6959%;\"\u003e\n \u003cp\u003e5/model + embed\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 40.0428%;\"\u003e\n \u003cp\u003e\u003cstrong\u003eEFT \u0026sigma; (4 models)\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 17.1306%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.787\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 17.1306%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e0.417\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 25.6959%;\"\u003e\n \u003cp\u003e\u003cstrong\u003e4\u003c/strong\u003e\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e\n\u003cp\u003e\u003cstrong\u003eGrounding reduces rates but not correlation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe hallucination base rate decreases monotonically with grounding (Unanchored 30.1% \u0026rarr; Anchored 27.1% \u0026rarr; EBP 26.3%; McNemar \u0026chi;\u0026sup2; = 55.9, \u003cem\u003ep\u003c/em\u003e \u0026lt; 0.001). However, overdispersion remains at approximately \u0026rho; \u0026asymp; 1.5 across all conditions (\u0026phi; = 0.505\u0026ndash;0.529; Fig. 3b). This dissociation reveals two independent dimensions of the hallucination problem: frequency (how often models hallucinate) and structure (the pattern of co-failure). Grounding interventions address frequency; \u0026sigma;-based gating addresses structure. Neither alone is sufficient.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eDecision gating analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eConsensus-gated commitment enables threshold-based automation with quantifiable coverage\u0026ndash;reliability trade-offs (Fig. 4). At \u0026sigma; \u0026ge; 0.6, hallucination rate drops to 11.9% (59.5% reduction) while retaining 38.2% coverage. At \u0026sigma; = 1.0, the rate reaches 5.8% (the shared blindness floor) with 24.5% coverage. This enables tiered deployment: high-\u0026sigma; responses proceed to confident automation; intermediate-\u0026sigma; responses enter cautious automation with monitoring; low-\u0026sigma; responses are routed to human review, reducing review burden by up to 72.8%.\u003c/p\u003e"},{"header":"Discussion","content":"\u003cp\u003eThese results establish hallucination as a structured, predictable phenomenon conditioned on epistemic context rather than independent sampling noise.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eTheoretical contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis work makes three theoretical contributions to the understanding of LLM reliability. First, we provide the first large-scale empirical falsification of the error independence assumption underlying self-consistency methods\u003csup\u003e3\u003c/sup\u003e. The finding that multi-model errors are systematically correlated (\u0026rho; = 1.50, \u0026phi; = 0.49\u0026ndash;0.55) directly challenges the theoretical foundation of approaches that assume super linear reliability gains from repeated sampling. Any reliability analysis based on self-consistency or majority voting overestimates its reliability by approximately 3\u0026times;\u0026mdash;a quantitative correction factor that should inform future theoretical work on ensemble reliability.\u003c/p\u003e\n\u003cp\u003eSecond, we establish that hallucination has two independent dimensions; a frequency and correlation structure that respond to different interventions. Grounding reduces hallucination rates (30.1% \u0026rarr; 26.3%) but leaves the overdispersion coefficient unchanged (\u0026rho; \u0026asymp; 1.5 across all conditions). This dissociation has not been previously demonstrated and suggests that the correlation structure arises from shared training data rather than from the query\u0026rsquo;s epistemic properties per se.\u003c/p\u003e\n\u003cp\u003eThird, we formalise the consensus field \u0026sigma; as a continuous generalisation of majority voting that preserves semantic richness. The information-theoretic gap between \u0026sigma; (AUC = 0.787) and majority voting (AUC = 0.518) on identical model responses demonstrates that the shift from binary answer-level agreement to continuous response-level semantic similarity captures substantially more predictive information about hallucination risk\u0026mdash;at zero additional computational cost.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003ePractical contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eEFT provides several directly deployable capabilities for AI systems operating in high-stakes environments.\u003c/p\u003e\n\u003cp\u003eThe consensus field \u0026sigma; serves as a real-time reliability signal that requires no model retraining, no access to internal model states, and operates with standard commercial API calls. This makes it immediately applicable in production settings where model internals are unavailable and the predominant deployment scenario for clinical, legal, and enterprise applications.\u003c/p\u003e\n\u003cp\u003eThe decision gating framework (Fig. 4) translates \u0026sigma; into actionable deployment policies. At \u0026sigma; \u0026ge; 0.8, hallucination rates drop to 8.2% with 29.5% coverage, enabling confident automation for nearly a third of queries. At \u0026sigma; \u0026lt; 0.2, the 50.1% hallucination rate justifies mandatory human review. The continuous nature of \u0026sigma; allows practitioners to set domain-appropriate thresholds: a clinical application might require \u0026sigma; \u0026ge; 0.8 for autonomous operation, while a customer service application might accept \u0026sigma; \u0026ge; 0.4. This threshold-based approach directly addresses the perpetual validation requirement described by Gallifant et al.\u003csup\u003e2\u003c/sup\u003e: \u0026sigma; provides a continuous monitoring signal that automatically detects when a model update causes disagreement with the ensemble, enabling drift detection without manual review cycles.\u003c/p\u003e\n\u003cp\u003eThe noise coefficient \u0026eta; provides a model-level reliability metric that is stable across domains (Supplementary Table 1). If validated at scale, \u0026eta; could serve as a standardised reliability coefficient for model procurement and regulatory compliance, an analogous to established safety metrics in other engineering domains.\u003c/p\u003e\n\u003cp\u003eThe shared blindness floor of 5.8% at \u0026sigma; = 1.0 quantifies the irreducible limitation of consensus-based methods, providing practitioners with a precise scope boundary. This measurement enables informed architectural decisions: systems requiring error rates below 5.8% must incorporate retrieval augmentation, authoritative knowledge bases, or human expert review in addition to consensus-based gating.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eRelationship to existing approaches\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe consensus field \u0026sigma; outperforms existing detection methods because it measures across models, accessing query-level epistemic properties that no single model can observe. Self-reported confidence is uninformative because current frontier models lack calibrated epistemic self-assessment when confidence is requested through prompting\u003csup\u003e4,5\u003c/sup\u003e. SelfCheck methods\u003csup\u003e6\u003c/sup\u003e sample from within a single model and inherit its systematic biases. Semantic entropy\u003csup\u003e8\u003c/sup\u003e detects hallucinations through meaning-level consistency within a single model; \u0026sigma; extends this principle to the cross-model setting, where the diversity of training backgrounds provides a stronger signal. The comparison with majority voting is especially informative: both methods consume identical model responses, yet \u0026sigma; achieves AUC = 0.787 versus 0.518, with the gap arising entirely from the shift from binary answer-level agreement to continuous semantic similarity across full responses.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLimitations\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eSeveral limitations warrant emphasis. We tested three professional domains; generalisation to creative, mathematical, or multilingual tasks is unvalidated. The informal theoretical bound (Supplementary Note 2) is looser than the idealised case given observed error correlations (\u0026phi; = 0.49\u0026ndash;0.55); deriving tighter bounds under explicit models of partial dependence is an important direction for future theoretical work. Noise coefficients will shift as models are updated, requiring periodic recalibration. The baseline comparison covers only the medical domain (N = 1,000); broader cross-domain baseline evaluation is warranted. The shared blindness floor of 5.8% represents an irreducible limitation requiring interventions orthogonal to consensus. Clinical validation within an integrated EHR workflow remains a critical next step. Finally, computational cost scales linearly with ensemble size; future work should evaluate whether \u0026sigma; from smaller ensembles (n = 2 or 3) retains sufficient discriminative power for practical deployment.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003e\u003cstrong\u003eModels and query design\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eFour frontier models were queried: Claude-Sonnet-3.5 (Anthropic), GPT-4o (OpenAI), Gemini-1.5-Pro (Google), and DeepSeek-V3 (DeepSeek). Queries spanned three domains: medical (30 topics), legal (30 topics), and technical (30 topics, comprising programming, cloud infrastructure, cybersecurity, and data science). Each of 90 base topics generated a chain of 4 turns: T1 (factual, context-answerable), T2 (inferential follow-up), T3 (edge case), T4 (speculative, beyond established knowledge). Additionally, 30 non-existent entity probes (fabricated drugs, legal cases, technical standards) stress-tested hallucination detection. Total: 298 unique topics \u0026times; 4 models \u0026times; 3 conditions + probes = 13,728 individual responses.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eExperimental conditions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThree conditions varied available information: EBP (anchor document plus epistemic commitment protocol\u003csup\u003e17\u003c/sup\u003e), Anchored (anchor document only), and Unanchored (no anchor, parametric knowledge only). This design enables analysis of how epistemic grounding affects both hallucination rates and error correlation patterns.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eLabelling, validation, and confidence elicitation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eResponses were labelled as correct, hallucination, or abstention. A sample of 500 responses was independently human-validated, achieving Cohen\u0026rsquo;s \u0026kappa; = 0.87. Self-reported confidence was elicited via a structured system prompt instructing each model to append a confidence score between 0.0 and 1.0 to every response. The prompt was identical across all four models. All four models reported confidence = 1.0 on 100% of responses (N = 13,728), including responses subsequently labelled as hallucinations. We did not use token-level log-probabilities, which are unavailable for most commercial APIs and would not be comparable across architectures.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eConsensus field computation\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe consensus field \u0026sigma;(q) is defined as the average pairwise semantic similarity across all model pairs: \u0026sigma;(q) = [2/n(n\u0026minus;1)] \u0026Sigma; sim(Mᵢ(q), Mⱼ(q)) for all i \u0026lt; j. For n = 4 models, this evaluates 6 pairwise comparisons per query. Semantic similarity was computed using embedding cosine similarity (text-embedding-3-small, OpenAI) with threshold 0.85 for binary agreement. Robustness was confirmed across thresholds 0.75\u0026ndash;0.95 (AUC-ROC stable at 0.77\u0026ndash;0.80), an alternative embedding model (text-embedding-ada-002; Spearman \u0026rho; = 0.91 with primary model), and response length controls (\u0026beta;_\u0026sigma; = \u0026minus;1.82, p \u0026lt; 0.001; \u0026beta;_length = 0.03, p = 0.41).\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eNoise coefficient\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe noise coefficient \u0026eta; for model M is defined as \u0026eta; = P(H(M(q), q) = 1 | \u0026sigma;(q) \u0026lt; \u0026tau;), where \u0026tau; = 0.5 is a low-consensus threshold. The hallucination predictor is P(H | q, M) = (1 \u0026minus; \u0026sigma;(q))\u0026middot;\u0026eta;.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eBaseline methods\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe baseline comparison study (N = 1,000, medical domain) evaluated five methods: random baseline, self-reported confidence, majority voting (4 models), SelfCheck-Jaccard\u003csup\u003e6\u003c/sup\u003e (N = 5 samples per model), and SelfCheck-Embedding (N = 5 samples per model plus embedding). All methods were evaluated on AUC-ROC, AUC-PR, Spearman \u0026rho;, and Pearson \u003cem\u003er\u003c/em\u003e using the same response data.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eStatistical analysis\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eOverdispersion was computed as \u0026rho; = [Var(H_q) \u0026minus; np(1\u0026minus;p)] / np(1\u0026minus;p), where \u0026rho; \u0026gt; 0 indicates correlated failures. Error correlation was assessed using phi coefficients between all model pairs. Condition effects were tested using McNemar\u0026rsquo;s test. Association between \u0026sigma; and hallucination was tested using Pearson and Spearman correlations, \u0026chi;\u0026sup2; test with Cram\u0026eacute;r\u0026rsquo;s V, and logistic regression with group cross-validation.\u003c/p\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe experimental prompts, generated outputs, embedding vectors, and statistical analysis code that support the findings of this study are available from the corresponding author upon reasonable request and will be deposited in a public repository upon acceptance.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCode availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe analysis scripts used to compute consensus metrics, overdispersion coefficients, baseline comparisons, and statistical tests are available from the corresponding author upon reasonable request.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\u003cli\u003e\u003cspan\u003eSakai Y, Kamigaito H, Watanabe T (2026) Hallucitation matters: revealing the impact of hallucinated references in ACL conferences. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2601.18724\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2601.18724\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGallifant J, Kellogg KC, Butler M et al (2025) A field guide to deploying AI agents in clinical practice. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2509.26153\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2509.26153\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eWang X, Wei J, Schuurmans D et al (2023) Self-consistency improves chain of thought reasoning in language models. In Proc. ICLR 2023\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eKadavath S, Conerly T, Askell A et al (2022) Language models (mostly) know what they know. Preprint at \u003cspan class=\"ExternalRef\"\u003e\u003cspan class=\"RefSource\"\u003ehttps://arxiv.org/abs/2207.05221\u003c/span\u003e\u003cspan address=\"https://arxiv.org/abs/2207.05221\" targettype=\"URL\" class=\"RefTarget\"\u003e\u003c/span\u003e\u003c/span\u003e\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eXiong M, Hu Z, Lu X et al (2024) Can LLMs express their uncertainty? An empirical evaluation of confidence elicitation in LLMs. In Proc. ICLR (2024)\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eManakul P, Liusie A, Gales MJF (2023) SelfCheckGPT: zero-resource black-box hallucination detection for generative large language models. In Proc. EMNLP 2023, 9004\u0026ndash;9017\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003e[Self-citation removed for double-blind review]\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFarquhar S, Kossen J, Kuhn L, Gal Y (2024) Detecting hallucinations in large language models using semantic entropy. Nature 630:625\u0026ndash;630\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eGal Y, Ghahramani Z (2016) Dropout as a Bayesian approximation: representing model uncertainty in deep learning. In Proc. ICML, 1050\u0026ndash;1059\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLakshminarayanan B, Pritzel A, Blundell C (2017) Simple and scalable predictive uncertainty estimation using deep ensembles. In Proc. NeurIPS 30\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eJi Z, Lee N, Frieske R et al (2023) Survey of hallucination in natural language generation. ACM Comput Surv 55:1\u0026ndash;38\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLewis P, Perez E, Piktus A et al (2020) Retrieval-augmented generation for knowledge-intensive NLP tasks. In Proc. NeurIPS 33, 9459\u0026ndash;9474\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eLin S, Hilton J, Evans O (2022) TruthfulQA: measuring how models mimic human falsehoods. In Proc. ACL, 3214\u0026ndash;3252\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eDziri N, Rashkin H, Linzen T, Reitter D (2022) On the origin of hallucinations in conversational models: is it the datasets or the models? In Proc. NAACL, 5271\u0026ndash;5285\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eMaynez J, Narayan S, Bohnet B, McDonald R (2020) On faithfulness and factuality in abstractive summarization. In Proc. ACL, 1906\u0026ndash;1919\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003eFeng S, Shi W, Wang Y et al (2024) Don\u0026rsquo;t hallucinate, abstain: identifying LLM knowledge gaps via multi-LLM collaboration. In Proc. ACL 2024, 14664\u0026ndash;14690\u003c/span\u003e\u003c/li\u003e \u003cli\u003e\u003cspan\u003e[Self-citation removed for double-blind review]\u003c/span\u003e\u003c/li\u003e\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"nature-portfolio","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"","title":"Nature Portfolio","twitterHandle":"","acdcEnabled":false,"dfaEnabled":false,"editorialSystem":"ejp","reportingPortfolio":"","inReviewEnabled":true,"inReviewRevisionsEnabled":false},"keywords":"","lastPublishedDoi":"10.21203/rs.3.rs-8929856/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-8929856/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eLarge language models hallucinate at rates that undermine reliability in high-stakes applications. Mitigation strategies based on repeated sampling or majority voting implicitly assume error independence across samples. We introduce Epistemic Field Theory (EFT), which predicts hallucination probability from multi-model consensus. EFT defines a consensus field σ ∈ [0, 1] over query space and derives the predictor P(H) = (1 − σ)·η, where η is a model-specific noise coefficient. Across 13,728 human-validated responses (Cohen’s κ = 0.87) from four frontier models in three professional domains, σ predicts hallucination with AUC = 0.787, outperforming majority voting (0.518), SelfCheck methods (0.358–0.377), and self-reported confidence (0.461). Hallucination counts exhibit systematic overdispersion (ρ = 1.50), with empirical majority-failure rates 2.96× higher than independence predicts. Epistemic grounding reduces hallucination rates but not error correlation, revealing frequency and structure as independent dimensions of the hallucination problem.\u003c/p\u003e","manuscriptTitle":"Epistemic Field Theory: Predicting and Governing Hallucination in Large Language Models via Multi-Model Consensus","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-02-27 02:39:49","doi":"10.21203/rs.3.rs-8929856/v1","editorialEvents":[],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"nature-communications","isNatureJournal":true,"hasQc":false,"allowDirectSubmit":false,"externalIdentity":"NCOMMS","sideBox":"Learn more about [Nature Communications](http://www.nature.com/ncomms/)","snPcode":"","submissionUrl":"https://mts-ncomms.nature.com/","title":"Nature Communications","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"ejp","reportingPortfolio":"Nature Communications","inReviewEnabled":true,"inReviewRevisionsEnabled":false}}],"origin":"","ownerIdentity":"b1b4fd6f-9acd-446b-9f8f-6e3b19523beb","owner":[],"postedDate":"February 27th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":63513942,"name":"Scientific community and society/Social sciences/Ethics"},{"id":63513943,"name":"Physical sciences/Mathematics and computing/Computational science"},{"id":63513944,"name":"Physical sciences/Mathematics and computing/Statistics"},{"id":63513945,"name":"Health sciences/Health care/Diagnosis"}],"tags":[],"updatedAt":"2026-02-27T02:39:49+00:00","versionOfRecord":[],"versionCreatedAt":"2026-02-27 02:39:49","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-8929856","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-8929856","identity":"rs-8929856","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

Ask this paper AI returns verbatim quotes from the full text · source: preprint-html

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc
last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall
last seen: 2026-05-23T02:00:01.238055+00:00
License: CC-BY-4.0