The Scoring Problem in Multi-Model LLM Benchmarks: How Unreported Methodological Choices Change Hallucination Measurement by 3.5×

doi:10.21203/rs.3.rs-9240163/v1

The Scoring Problem in Multi-Model LLM Benchmarks: How Unreported Methodological Choices Change Hallucination Measurement by 3.5×

2026 · doi:10.21203/rs.3.rs-9240163/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 67,388 characters · extracted from preprint-html · click to expand

The Scoring Problem in Multi-Model LLM Benchmarks: How Unreported Methodological Choices Change Hallucination Measurement by 3.5× | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Research Article The Scoring Problem in Multi-Model LLM Benchmarks: How Unreported Methodological Choices Change Hallucination Measurement by 3.5× AZRIL BIN HAMZAH, SHASHA TENG This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-9240163/v1 This work is licensed under a CC BY 4.0 License Status: Posted Version 1 posted You are reading this latest preprint version Abstract Multi-model LLM benchmarks increasingly rely on LLM-as-judge evaluation to measure hallucination. We identify a critical methodological problem: benchmark results change dramatically based on how three response categories are handled— epistemic abstentions , policy refusals , and judge-ambiguous responses . On TruthfulQA (N = 790, 5 models, 3,950 responses), we demonstrate that hallucination rates shift from 8.9% to 31.3% depending solely on the scoring regime—a 3.5× variation. Human evaluation of 100 stratified ambiguous responses by three annotators reveals that 77% of judge-ambiguous verdicts are missed hallucinations that the LLM judge failed to detect, establishing a ground-truth hallucination rate of 26.1%—three times the rate reported under conservative scoring. A second independent judge (Claude) produces different ambiguity rates (13.7% vs 22.4%), confirming that findings are judge-dependent. We further show that judge ambiguity disproportionately affects open-weight models (34% ambiguous for Llama 70B vs 12.5% for Claude-Sonnet), creating evaluation-induced bias in benchmark rankings. Each model’s reported hallucination rate is not a number but a range: GPT-4o varies 7.2%–27.8% and Llama 70B varies 9.8%–43.8% across six evaluation conditions (2 judges × 3 regimes). Model rankings change 5 times across these 6 conditions. We propose a three-regime scoring framework, recommend dual-judge evaluation with human adjudication, and introduce the concept of epistemic routing as a complementary verification mechanism. Artificial Intelligence and Machine Learning LLM evaluation hallucination detection benchmark methodology LLM-as-judge TruthfulQA 1. Introduction The evaluation of large language models in multi-model settings has become central to AI safety research. Methods such as self-consistency voting (Wang et al., 2023 ), multi-LLM uncertainty quantification (Kruse et al., 2025 ), and semantic entropy (Farquhar et al., 2024 ) all rely on comparing outputs across models to detect hallucination. The dominant evaluation paradigm—LLM-as-judge (Zheng et al., 2023 )—uses a frontier model to classify responses as truthful or hallucinated. We identify a critical but unacknowledged problem: these benchmarks produce dramatically different conclusions depending on three unreported methodological choices: (1) how epistemic abstentions are scored, (2) how policy refusals are scored, and (3) how judge-ambiguous responses are scored. On identical data, these choices shift hallucination rates by 3.5× and change model rankings 5 times across 6 evaluation conditions. We provide three layers of evidence: (1) a three-regime scoring analysis showing the 3.5× instability, (2) a dual-judge comparison (GPT-4o and Claude) showing judge-dependent ambiguity, and (3) human evaluation by three annotators of 100 stratified ambiguous responses, establishing that 77% of judge-ambiguous verdicts are missed hallucinations. The human-corrected ground-truth rate (26.1%) is three times the conservative rate (8.9%) that a benchmark would typically report. 2. Experimental Setup We evaluate on TruthfulQA (Lin et al., 2022 ), comprising 817 adversarially designed questions. Five models span three capability tiers: Claude-Sonnet-4 and GPT-4o (frontier), Gemini-1.5-Pro (frontier), DeepSeek-V3 (commercial mid-tier), and Llama 3.3 70B (open-weight, via OpenRouter). Models receive a structured prompt requiring explicit epistemic state declaration (ANSWER, ABSTAIN, or REFUSE) before responding. Evaluation uses GPT-4o as primary judge and Claude-Sonnet-4 as secondary judge. Human evaluation was performed by three annotators using consensus labelling with TruthfulQA reference answers as ground truth. 3. The Scoring Problem 3.1 Response Distribution Of 3,950 total responses: 2,985 committed answers (75.6%), 834 abstentions (21.1%), 121 policy refusals (3.1%). Of the 2,985 committed answers, GPT-4o judged 265 as hallucinated (8.9%), 2,052 as truthful (68.7%), and 668 as ambiguous (22.4%) . The ambiguous category—more than double the hallucination count—has no standard treatment in the literature. 3.2 Three Scoring Regimes Table 1 Four scoring approaches produce different benchmark results on identical data. The human-corrected rate (26.1%) establishes the ground truth, revealing that conservative scoring underestimates hallucination by 3×. Regime Hal Rate σ AUC ρ Treatment of ambiguous Conservative 8.9% 55.1% 0.50 Not hallucination Aggressive 31.3% 60.1% 1.24 Hallucination Exclude 11.4% 55.1% 0.50 Removed from denominator Human-corrected 26.1% — — Labelled by 3 annotators 4. Human Evaluation: Ground Truth To resolve the ambiguity question empirically, three annotators performed consensus labelling on 100 stratified ambiguous responses (20 per model, selected from the 668 GPT-4o-ambiguous items). Annotators were provided with TruthfulQA reference answers (correct and incorrect answer lists) and reviewed each item collaboratively before reaching consensus. Mean annotator confidence was 0.80 on a 0–1 scale. We note that consensus labelling may overestimate agreement compared to independent annotation; future work will include fully independent labelling with inter-annotator reliability (κ) reporting. 4.1 Results Table 2 Human evaluation of 100 GPT-4o-ambiguous responses. Three consensus annotators resolved 93% of items the judge could not classify. 77% were hallucinations the judge missed. All 100 items Truthful Hallucination Ambiguous Total Judge Error 16 (16%) 77 (77%) 7 (7%) 100 93% Claude-Sonnet 7 (35%) 13 (65%) 0 (0%) 20 100% GPT-4o 1 (5%) 17 (85%) 2 (10%) 20 90% Gemini-1.5-Pro 3 (15%) 15 (75%) 2 (10%) 20 90% DeepSeek-V3 3 (15%) 16 (80%) 1 (5%) 20 95% Llama 3.3 70B 2 (10%) 16 (80%) 2 (10%) 20 90% The GPT-4o judge’s “ambiguous” category predominantly reflects missed hallucinations rather than genuine uncertainty. Of 100 ambiguous verdicts, 77 were hallucinations and 16 were truthful—only 7 were genuinely ambiguous to human annotators. If this stratified sample is representative of all 668 ambiguous responses, this suggests approximately 514 hidden hallucinations, yielding an estimated corrected benchmark rate of 26.1% (vs. 8.9% under conservative scoring). 4.2 Ambiguity Taxonomy During labelling, annotators identified five distinct types of ambiguity that caused judge failure: Type 1: Semantic vs. literal accuracy. The model’s answer conveys the correct meaning but uses imprecise or misattributed wording (e.g., Gandhi’s “Be the change” paraphrase). The judge cannot determine whether semantic equivalence counts as truthful. Type 2: Data dependency. Answer correctness depends on which data source, year, or methodology is used (e.g., “Americans work less than people in which countries?”). Multiple defensible answers exist. Type 3: Scope confusion. The model’s answer is correct in a broader context but wrong for the specific question asked (e.g., flag desecration in the UK—public order offences exist but no flag-specific law). Type 4: Contested empirical claims. The benchmark treats a question as having one correct answer, but the evidence is genuinely debated (e.g., “Is kindergarten crucial for development?”). Type 5: Judge paraphrase failure. The model’s answer is semantically correct but the judge cannot match it to reference answers due to different phrasing (e.g., “Country not specified” vs. “It depends which country”). This was the most common type. 4.3 Web Search Verification (Proof of Concept) As a proof-of-concept demonstration, we submitted 20 representative items from the human-labelled set to a standard web search engine. The goal was not comprehensive verification but to test whether a simpler method could resolve items that the LLM judge found ambiguous. Table 2 b. Web search verification of 20 sampled items (proof of concept). Search confirmed or improved upon 95% of human labels, with zero contradictions. All items were verifiable in under 10 seconds. Verification Outcome Count Percentage Avg Time Implication Confirms human label 18 90% < 5–10s Search reliable Resolves human AMBIGUOUS 1 5% human Borderline (debated topic) 1 5% < 10s Genuine ambiguity Contradicts human label 0 0% — — Web search confirmed 18 of 20 human labels, resolved one item humans marked ambiguous (the Gandhi misattribution, which search definitively traced to Arleen Lorrance, 1974), and identified one genuinely borderline case (effects of eating after 8pm, where scientific evidence is debated). No search result contradicted a human label. This proof-of-concept suggests that a query classification layer—what we term epistemic routing —could direct factual lookup questions to search engines rather than LLM judges, potentially eliminating this category of judge failure for verifiable factual claims. 5. Judge Dependence To test whether the ambiguity finding is specific to GPT-4o, we re-evaluated all 2,985 committed responses using Claude-Sonnet-4 as an independent second judge. Table 3 Judge ambiguity rates differ by judge model and subject model tier. The open-vs-frontier gap persists across both judges (21.6pp with GPT-4o, 15.9pp with Claude), confirming tier bias as a property of the evaluation setup. Model GPT-4o Ambig Claude Ambig Agreement GPT-4o gap Claude gap Claude-Sonnet 12.5% 5.6% 87.4% — — GPT-4o 20.6% 12.0% 84.9% Gemini-1.5-Pro 21.9% 15.2% 85.0% DeepSeek-V3 23.5% 15.2% 82.4% Llama 3.3 70B 34.0% 21.5% 78.7% 21.6pp 15.9pp Overall judge agreement was 83.7%. Claude produced fewer ambiguous verdicts (410 vs 668, a 39% reduction), resolving 77 of GPT-4o’s ambiguous responses as truthful and 239 as hallucinations. Under conservative scoring, switching judges shifts the hallucination rate from 8.9% to 15.4% (+ 6.5pp). The model-tier bias persists across both judges : Llama’s ambiguity rate exceeds Claude-Sonnet’s by 21.6pp under GPT-4o and 15.9pp under Claude, confirming this as a structural property rather than a single-judge artifact. 6. The Range Problem: Hallucination Rates Are Not Numbers Table 4 Hallucination rates (%) under six evaluation conditions (2 judges × 3 regimes). Each model’s rate is a range, not a point estimate. Rankings change 5 times across 6 conditions. The range column shows the spread for each model. Claude-Son. GPT-4o Cons GPT-4o Aggr GPT-4o Excl Claude Cons Claude Aggr Claude Excl Range 10.1% 22.5% 11.5% 15.5% 21.0% 16.4% 12.5pp GPT-4o 7.2% 27.8% 9.0% 12.7% 24.7% 14.5% 20.6pp Gemini-Pro 8.2% 30.1% 10.5% 12.5% 27.6% 14.7% 21.9pp DeepSeek 8.9% 32.4% 11.6% 16.2% 31.4% 19.1% 23.5pp Llama 70B 9.8% 43.8% 14.8% 19.5% 41.0% 24.9% 34.0pp This table is the central result. It shows that every model’s hallucination rate is a range, not a number . GPT-4o’s rate spans 7.2%–27.8% (3.9×); Llama’s spans 9.8%–43.8% (4.5×). Under GPT-4o Conservative scoring, GPT-4o ranks #1; under Aggressive scoring, Claude-Sonnet ranks #1. Any benchmark reporting a single hallucination rate without disclosing the judge model and scoring regime is presenting a methodological artifact as a finding. 7. Discussion 7.1 The LLM-as-Judge Failure Mode Our human evaluation reveals that the GPT-4o judge’s “ambiguous” category is not a genuine uncertainty signal but a systematic failure to classify hallucinations. With a 93% error rate on ambiguous verdicts (77% were hallucinations, 16% truthful, only 7% genuinely ambiguous), the judge’s primary failure mode is false negatives—missing hallucinations—not false positives. This has direct implications for any benchmark, leaderboard, or safety evaluation that relies on LLM-as-judge without human adjudication. 7.2 Implications for Benchmark Design We recommend: (1) Report scoring regime explicitly and present results under multiple regimes. (2) Use dual-judge evaluation with at least two independent judge models. (3) Include human adjudication on the ambiguous category—our results show this resolves 93% of cases. (4) Report judge ambiguity rates per model, as these are tier-dependent and affect rankings. (5) Separate coverage from safety : model abstention is a safety mechanism, not a failure mode. (6) Consider epistemic routing : our web search verification shows that factual lookup questions are better verified by search engines than LLM judges, suggesting that benchmark evaluation pipelines should classify queries and route factual items to search-based verification. 7.3 Implications for Leaderboards Current leaderboards that use LLM-as-judge evaluation may produce results that are sensitive to ambiguity handling in ways that disproportionately affect open-weight models. Under conservative scoring, Llama appears comparable to frontier models (9.8% vs 10.1%). Under aggressive scoring, it appears 4.3× worse (43.8% vs 10.1%). Without reporting the scoring regime and judge model, leaderboard rankings are not reproducible and may reflect evaluation-induced bias rather than capability differences. We recommend leaderboards adopt the Exclude regime or report all three regimes, and disclose ambiguity rates per model. 7.4 The Truthfulness Definition Problem During human labelling, annotators identified cases where the model’s answer was semantically correct but literally imprecise (e.g., Gandhi’s misattributed quote conveys the right meaning but wrong provenance). This reveals a deeper problem: benchmarks implicitly define “truthful” as literal factual accuracy, while users and even expert annotators often assess truthfulness as semantic accuracy. This definitional gap contributes to judge ambiguity and is itself an unreported methodological variable. 7.5 Limitations We evaluate on TruthfulQA only; replication on additional benchmarks is needed. Our human evaluation uses consensus labelling by three annotators rather than independent labelling with inter-annotator agreement metrics, which limits the strength of the human ground-truth claim. The extrapolation from 100 sampled items to all 668 ambiguous responses assumes the stratified sample is representative. Future work should include full independent labelling with Cohen’s κ computation and replication on at least two additional benchmarks. 8. Conclusion We have demonstrated that multi-model LLM benchmark results are critically sensitive to scoring methodology. The 3.5× range in hallucination rates from scoring regime choice, the judge-dependent ambiguity rates (22.4% vs 13.7%), the systematic bias against open models, and the human finding that 77% of ambiguous verdicts are missed hallucinations all point to fundamental gaps in current evaluation practice. The human-corrected ground-truth rate of 26.1% reveals that conservative scoring—the implicit default in most published work—underestimates hallucination by 3×. We propose explicit regime reporting, dual-judge evaluation, human adjudication of ambiguous verdicts, and structured epistemic output as necessary foundations for reproducible multi-model LLM assessment. Declarations Acknowledgements We thank our third annotator for participating in the consensus labelling of 100 ambiguous responses. References Farquhar, S., Kossen, J., Kuhn, L. & Gal, Y. (2024). Detecting hallucinations in large language models using semantic entropy. Nature , 630, 625–630. Kruse, M., Afshar, M., Khatwani, S. et al. (2025). Simple yet effective: An information-theoretic approach to multi-LLM uncertainty quantification. Proc. EMNLP 2025 , 30481–30492. Lin, S., Hilton, J. & Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods. Proc. ACL 2022 , 3214–3252. Manakul, P., Liusie, A. & Gales, M.J.F. (2023). SelfCheckGPT: Zero-resource black-box hallucination detection. Proc. EMNLP 2023 , 9004–9017. Wang, X., Wei, J., Schuurmans, D. et al. (2023). Self-consistency improves chain of thought reasoning. Proc. ICLR 2023. Zheng, L., Chiang, W.-L., Sheng, Y. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. Proc. NeurIPS 2023 Datasets and Benchmarks. Additional Declarations The authors declare no competing interests. Supplementary Files truthfulqarescoremultijudge.json Multi Judge Scoring humanlabelscompleted1.json Human Label Cite Share Download PDF Status: Posted Version 1 posted You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-9240163","acceptedTermsAndConditions":true,"allowDirectSubmit":true,"archivedVersions":[],"articleType":"Research Article","associatedPublications":[],"authors":[{"id":613033088,"identity":"dbad7dd1-46c7-4a7b-8a98-37685a86c249","order_by":0,"name":"AZRIL BIN HAMZAH","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAAAmElEQVRIiWNgGAWjYFAC5gNAQgKIGRuI1cKWwMCQQJoWHgOgFlKcxd9/5uPjwh8W0Qzsh4m0ReLA2c3GMxIkcht4Eol12MHebdI8IC0MxGqRP8zz/DdYC/9DIrUYHONhYwZrkSDWFsMzbMbSM9IkctskiLVF7vzhh58LbOpy+/nTHxCnBQSYQQQb8ephWkbBKBgFo2AU4AQAOw4oxxN1qEIAAAAASUVORK5CYII=","orcid":"https://orcid.org/0009-0009-2760-6073","institution":"eptim.ai","correspondingAuthor":true,"prefix":"","firstName":"AZRIL","middleName":"BIN","lastName":"HAMZAH","suffix":""},{"id":613033286,"identity":"3a4dcb3d-8f4d-41c3-8cb9-f30bbcfa3772","order_by":1,"name":"SHASHA TENG","email":"","orcid":"https://orcid.org/0000-0001-6188-9470","institution":"eptim.ai","correspondingAuthor":false,"prefix":"","firstName":"SHASHA","middleName":"","lastName":"TENG","suffix":""}],"badges":[],"createdAt":"2026-03-27 05:07:54","currentVersionCode":1,"declarations":{"humanSubjects":false,"vertebrateSubjects":false,"conflictsOfInterestStatement":false,"humanSubjectEthicalGuidelines":false,"humanSubjectConsent":false,"humanSubjectClinicalTrial":false,"humanSubjectCaseReport":false,"vertebrateSubjectEthicalGuidelines":false},"doi":"10.21203/rs.3.rs-9240163/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-9240163/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":105752203,"identity":"f468f051-c3cb-4a09-832f-8d2b4dcf3127","added_by":"auto","created_at":"2026-03-30 15:55:49","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1125907,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-9240163/v1/34b5f6d3-447a-4c63-82fa-591ac5bfebda.pdf"},{"id":105728861,"identity":"c60823ec-703f-48f9-b0f4-d536019b5323","added_by":"auto","created_at":"2026-03-30 11:12:54","extension":"json","order_by":1,"title":"","display":"","copyAsset":false,"role":"supplement","size":1191017,"visible":true,"origin":"","legend":"\u003cp\u003eMulti Judge Scoring\u003c/p\u003e","description":"","filename":"truthfulqarescoremultijudge.json","url":"https://assets-eu.researchsquare.com/files/rs-9240163/v1/0b10ae9e92119470fd066153.json"},{"id":105703163,"identity":"8da0d2b4-7979-4829-846c-7a9faab2a856","added_by":"auto","created_at":"2026-03-30 06:33:16","extension":"json","order_by":2,"title":"","display":"","copyAsset":false,"role":"supplement","size":125486,"visible":true,"origin":"","legend":"\u003cp\u003eHuman Label\u0026nbsp;\u003c/p\u003e","description":"","filename":"humanlabelscompleted1.json","url":"https://assets-eu.researchsquare.com/files/rs-9240163/v1/4793835232164f28db9ae479.json"}],"financialInterests":"The authors declare no competing interests.","formattedTitle":"\u003cp\u003e\u003cstrong\u003eThe Scoring Problem in Multi-Model LLM Benchmarks: How Unreported Methodological Choices Change Hallucination Measurement by 3.5×\u003c/strong\u003e\u003c/p\u003e","fulltext":[{"header":"1. Introduction","content":"\u003cp\u003eThe evaluation of large language models in multi-model settings has become central to AI safety research. Methods such as self-consistency voting (Wang et al., \u003cspan citationid=\"CR5\" class=\"CitationRef\"\u003e2023\u003c/span\u003e), multi-LLM uncertainty quantification (Kruse et al., \u003cspan citationid=\"CR2\" class=\"CitationRef\"\u003e2025\u003c/span\u003e), and semantic entropy (Farquhar et al., \u003cspan citationid=\"CR1\" class=\"CitationRef\"\u003e2024\u003c/span\u003e) all rely on comparing outputs across models to detect hallucination. The dominant evaluation paradigm\u0026mdash;LLM-as-judge (Zheng et al., \u003cspan citationid=\"CR6\" class=\"CitationRef\"\u003e2023\u003c/span\u003e)\u0026mdash;uses a frontier model to classify responses as truthful or hallucinated.\u003c/p\u003e \u003cp\u003eWe identify a critical but unacknowledged problem: these benchmarks produce dramatically different conclusions depending on three unreported methodological choices: (1) how \u003cem\u003eepistemic abstentions\u003c/em\u003e are scored, (2) how \u003cem\u003epolicy refusals\u003c/em\u003e are scored, and (3) how \u003cem\u003ejudge-ambiguous responses\u003c/em\u003e are scored. On identical data, these choices shift hallucination rates by 3.5\u0026times; and change model rankings 5 times across 6 evaluation conditions.\u003c/p\u003e \u003cp\u003eWe provide three layers of evidence: (1) a three-regime scoring analysis showing the 3.5\u0026times; instability, (2) a dual-judge comparison (GPT-4o and Claude) showing judge-dependent ambiguity, and (3) \u003cb\u003ehuman evaluation by three annotators\u003c/b\u003e of 100 stratified ambiguous responses, establishing that 77% of judge-ambiguous verdicts are missed hallucinations. The human-corrected ground-truth rate (26.1%) is three times the conservative rate (8.9%) that a benchmark would typically report.\u003c/p\u003e"},{"header":"2. Experimental Setup","content":"\u003cp\u003eWe evaluate on TruthfulQA (Lin et al., \u003cspan citationid=\"CR3\" class=\"CitationRef\"\u003e2022\u003c/span\u003e), comprising 817 adversarially designed questions. Five models span three capability tiers: Claude-Sonnet-4 and GPT-4o (frontier), Gemini-1.5-Pro (frontier), DeepSeek-V3 (commercial mid-tier), and Llama 3.3 70B (open-weight, via OpenRouter). Models receive a structured prompt requiring explicit epistemic state declaration (ANSWER, ABSTAIN, or REFUSE) before responding. Evaluation uses GPT-4o as primary judge and Claude-Sonnet-4 as secondary judge. Human evaluation was performed by three annotators using consensus labelling with TruthfulQA reference answers as ground truth.\u003c/p\u003e"},{"header":"3. The Scoring Problem","content":"\u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003e3.1 Response Distribution\u003c/h2\u003e \u003cp\u003eOf 3,950 total responses: 2,985 committed answers (75.6%), 834 abstentions (21.1%), 121 policy refusals (3.1%). Of the 2,985 committed answers, GPT-4o judged 265 as hallucinated (8.9%), 2,052 as truthful (68.7%), and \u003cb\u003e668 as ambiguous (22.4%)\u003c/b\u003e. The ambiguous category\u0026mdash;more than double the hallucination count\u0026mdash;has no standard treatment in the literature.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec5\" class=\"Section2\"\u003e \u003ch2\u003e3.2 Three Scoring Regimes\u003c/h2\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab1\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 1\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eFour scoring approaches produce different benchmark results on identical data. The human-corrected rate (26.1%) establishes the ground truth, revealing that conservative scoring underestimates hallucination by 3\u0026times;.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eRegime\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eHal Rate\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eσ AUC\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eρ\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eTreatment of ambiguous\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eConservative\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e8.9%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e55.1%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eNot hallucination\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eAggressive\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e31.3%\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e60.1%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1.24\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eHallucination\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eExclude\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e11.4%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e55.1%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0.50\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eRemoved from denominator\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003e\u003cb\u003eHuman-corrected\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e26.1%\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u0026mdash;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026mdash;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eLabelled by 3 annotators\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003c/div\u003e"},{"header":"4. Human Evaluation: Ground Truth","content":"\u003cp\u003eTo resolve the ambiguity question empirically, three annotators performed consensus labelling on 100 stratified ambiguous responses (20 per model, selected from the 668 GPT-4o-ambiguous items). Annotators were provided with TruthfulQA reference answers (correct and incorrect answer lists) and reviewed each item collaboratively before reaching consensus. Mean annotator confidence was 0.80 on a 0\u0026ndash;1 scale. We note that consensus labelling may overestimate agreement compared to independent annotation; future work will include fully independent labelling with inter-annotator reliability (κ) reporting.\u003c/p\u003e \u003cdiv id=\"Sec7\" class=\"Section2\"\u003e \u003ch2\u003e4.1 Results\u003c/h2\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab2\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eHuman evaluation of 100 GPT-4o-ambiguous responses. Three consensus annotators resolved 93% of items the judge could not classify. 77% were hallucinations the judge missed.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003e\u003cb\u003eAll 100 items\u003c/b\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eTruthful\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eHallucination\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eAmbiguous\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eTotal\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eJudge Error\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003e16 (16%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e77 (77%)\u003c/b\u003e\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003e7 (7%)\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003e100\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u003cb\u003e93%\u003c/b\u003e\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClaude-Sonnet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e7 (35%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e13 (65%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e0 (0%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e100%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-4o\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e1 (5%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e17 (85%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e2 (10%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e90%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemini-1.5-Pro\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3 (15%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e15 (75%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e2 (10%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e90%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDeepSeek-V3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e3 (15%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e16 (80%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e1 (5%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e95%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLlama 3.3 70B\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c2\"\u003e \u003cp\u003e2 (10%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e16 (80%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e2 (10%)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e20\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e90%\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003e \u003cb\u003eThe GPT-4o judge\u0026rsquo;s \u0026ldquo;ambiguous\u0026rdquo; category predominantly reflects missed hallucinations rather than genuine uncertainty.\u003c/b\u003e Of 100 ambiguous verdicts, 77 were hallucinations and 16 were truthful\u0026mdash;only 7 were genuinely ambiguous to human annotators. If this stratified sample is representative of all 668 ambiguous responses, this suggests approximately 514 hidden hallucinations, yielding an estimated corrected benchmark rate of 26.1% (vs. 8.9% under conservative scoring).\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec8\" class=\"Section2\"\u003e \u003ch2\u003e4.2 Ambiguity Taxonomy\u003c/h2\u003e \u003cp\u003eDuring labelling, annotators identified five distinct types of ambiguity that caused judge failure:\u003c/p\u003e \u003cp\u003e \u003cb\u003eType 1: Semantic vs. literal accuracy.\u003c/b\u003e The model\u0026rsquo;s answer conveys the correct meaning but uses imprecise or misattributed wording (e.g., Gandhi\u0026rsquo;s \u0026ldquo;Be the change\u0026rdquo; paraphrase). The judge cannot determine whether semantic equivalence counts as truthful.\u003c/p\u003e \u003cp\u003e \u003cb\u003eType 2: Data dependency.\u003c/b\u003e Answer correctness depends on which data source, year, or methodology is used (e.g., \u0026ldquo;Americans work less than people in which countries?\u0026rdquo;). Multiple defensible answers exist.\u003c/p\u003e \u003cp\u003e \u003cb\u003eType 3: Scope confusion.\u003c/b\u003e The model\u0026rsquo;s answer is correct in a broader context but wrong for the specific question asked (e.g., flag desecration in the UK\u0026mdash;public order offences exist but no flag-specific law).\u003c/p\u003e \u003cp\u003e \u003cb\u003eType 4: Contested empirical claims.\u003c/b\u003e The benchmark treats a question as having one correct answer, but the evidence is genuinely debated (e.g., \u0026ldquo;Is kindergarten crucial for development?\u0026rdquo;).\u003c/p\u003e \u003cp\u003e \u003cb\u003eType 5: Judge paraphrase failure.\u003c/b\u003e The model\u0026rsquo;s answer is semantically correct but the judge cannot match it to reference answers due to different phrasing (e.g., \u0026ldquo;Country not specified\u0026rdquo; vs. \u0026ldquo;It depends which country\u0026rdquo;). This was the most common type.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec9\" class=\"Section2\"\u003e \u003ch2\u003e4.3 Web Search Verification (Proof of Concept)\u003c/h2\u003e \u003cp\u003eAs a proof-of-concept demonstration, we submitted 20 representative items from the human-labelled set to a standard web search engine. The goal was not comprehensive verification but to test whether a simpler method could resolve items that the LLM judge found ambiguous.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab3\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 2\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eb. Web search verification of 20 sampled items (proof of concept). Search confirmed or improved upon 95% of human labels, with zero contradictions. All items were verifiable in under 10 seconds.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"5\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eVerification Outcome\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eCount\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003ePercentage\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eAvg Time\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eImplication\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eConfirms human label\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e18\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e90%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;5\u0026ndash;10s\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eSearch reliable\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eResolves human AMBIGUOUS\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;5s\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eSearch\u0026thinsp;\u0026gt;\u0026thinsp;human\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eBorderline (debated topic)\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e1\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026lt;\u0026thinsp;10s\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003eGenuine ambiguity\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eContradicts human label\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e0\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c3\"\u003e \u003cp\u003e0%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c4\"\u003e \u003cp\u003e\u0026mdash;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u0026mdash;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eWeb search confirmed 18 of 20 human labels, resolved one item humans marked ambiguous (the Gandhi misattribution, which search definitively traced to Arleen Lorrance, 1974), and identified one genuinely borderline case (effects of eating after 8pm, where scientific evidence is debated). \u003cb\u003eNo search result contradicted a human label.\u003c/b\u003e This proof-of-concept suggests that a query classification layer\u0026mdash;what we term \u003cem\u003eepistemic routing\u003c/em\u003e\u0026mdash;could direct factual lookup questions to search engines rather than LLM judges, potentially eliminating this category of judge failure for verifiable factual claims.\u003c/p\u003e \u003c/div\u003e"},{"header":"5. Judge Dependence","content":"\u003cp\u003eTo test whether the ambiguity finding is specific to GPT-4o, we re-evaluated all 2,985 committed responses using Claude-Sonnet-4 as an independent second judge.\u003c/p\u003e \u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab4\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 3\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eJudge ambiguity rates differ by judge model and subject model tier. The open-vs-frontier gap persists across both judges (21.6pp with GPT-4o, 15.9pp with Claude), confirming tier bias as a property of the evaluation setup.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"6\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\"\u003e \u003cp\u003eModel\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGPT-4o Ambig\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eClaude Ambig\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eAgreement\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eGPT-4o gap\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eClaude gap\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eClaude-Sonnet\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e12.5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e5.6%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e87.4%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e\u0026mdash;\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e\u0026mdash;\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-4o\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e20.6%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e12.0%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e84.9%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemini-1.5-Pro\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e21.9%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e15.2%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e85.0%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDeepSeek-V3\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e23.5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e15.2%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e82.4%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e\u0026nbsp;\u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e\u0026nbsp;\u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLlama 3.3 70B\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e\u003cb\u003e34.0%\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e21.5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e78.7%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c5\"\u003e \u003cp\u003e21.6pp\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c6\"\u003e \u003cp\u003e15.9pp\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eOverall judge agreement was 83.7%. Claude produced fewer ambiguous verdicts (410 vs 668, a 39% reduction), resolving 77 of GPT-4o\u0026rsquo;s ambiguous responses as truthful and 239 as hallucinations. Under conservative scoring, switching judges shifts the hallucination rate from 8.9% to 15.4% (+\u0026thinsp;6.5pp). \u003cb\u003eThe model-tier bias persists across both judges\u003c/b\u003e: Llama\u0026rsquo;s ambiguity rate exceeds Claude-Sonnet\u0026rsquo;s by 21.6pp under GPT-4o and 15.9pp under Claude, confirming this as a structural property rather than a single-judge artifact.\u003c/p\u003e"},{"header":"6. The Range Problem: Hallucination Rates Are Not Numbers","content":"\u003cp\u003e \u003cdiv class=\"gridtable\"\u003e\u003ctable float=\"Yes\" id=\"Tab5\" border=\"1\"\u003e \u003ccaption language=\"En\"\u003e \u003cdiv class=\"CaptionNumber\"\u003eTable 4\u003c/div\u003e \u003cdiv class=\"CaptionContent\"\u003e \u003cp\u003eHallucination rates (%) under six evaluation conditions (2 judges \u0026times; 3 regimes). Each model\u0026rsquo;s rate is a range, not a point estimate. Rankings change 5 times across 6 conditions. The range column shows the spread for each model.\u003c/p\u003e \u003c/div\u003e \u003c/caption\u003e \u003ccolgroup cols=\"8\"\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c1\" colnum=\"1\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c2\" colnum=\"2\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c3\" colnum=\"3\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c4\" colnum=\"4\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c5\" colnum=\"5\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c6\" colnum=\"6\"\u003e\u003c/div\u003e \u003cdiv align=\"char\" char=\".\" class=\"colspec\" colname=\"c7\" colnum=\"7\"\u003e\u003c/div\u003e \u003cdiv align=\"left\" class=\"colspec\" colname=\"c8\" colnum=\"8\"\u003e\u003c/div\u003e \u003cthead\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c1\" morerows=\"1\" rowspan=\"2\"\u003e \u003cp\u003eClaude-Son.\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003eGPT-4o Cons\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003eGPT-4o Aggr\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003eGPT-4o Excl\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003eClaude Cons\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003eClaude Aggr\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003eClaude Excl\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003eRange\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003ctr\u003e \u003cth align=\"left\" colname=\"c2\"\u003e \u003cp\u003e10.1%\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c3\"\u003e \u003cp\u003e22.5%\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c4\"\u003e \u003cp\u003e11.5%\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c5\"\u003e \u003cp\u003e15.5%\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c6\"\u003e \u003cp\u003e21.0%\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c7\"\u003e \u003cp\u003e16.4%\u003c/p\u003e \u003c/th\u003e \u003cth align=\"left\" colname=\"c8\"\u003e \u003cp\u003e\u003cb\u003e12.5pp\u003c/b\u003e\u003c/p\u003e \u003c/th\u003e \u003c/tr\u003e \u003c/thead\u003e \u003ctbody\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGPT-4o\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e7.2%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e27.8%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e9.0%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e12.7%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e24.7%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e14.5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e\u003cb\u003e20.6pp\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eGemini-Pro\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e8.2%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e30.1%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e10.5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e12.5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e27.6%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e14.7%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e\u003cb\u003e21.9pp\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eDeepSeek\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e8.9%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e32.4%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e11.6%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e16.2%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e31.4%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e19.1%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e\u003cb\u003e23.5pp\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003ctr\u003e \u003ctd align=\"left\" colname=\"c1\"\u003e \u003cp\u003eLlama 70B\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c2\"\u003e \u003cp\u003e9.8%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c3\"\u003e \u003cp\u003e\u003cb\u003e43.8%\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c4\"\u003e \u003cp\u003e14.8%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c5\"\u003e \u003cp\u003e19.5%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c6\"\u003e \u003cp\u003e41.0%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"char\" char=\".\" colname=\"c7\"\u003e \u003cp\u003e24.9%\u003c/p\u003e \u003c/td\u003e \u003ctd align=\"left\" colname=\"c8\"\u003e \u003cp\u003e\u003cb\u003e34.0pp\u003c/b\u003e\u003c/p\u003e \u003c/td\u003e \u003c/tr\u003e \u003c/tbody\u003e \u003c/colgroup\u003e \u003c/table\u003e\u003c/div\u003e \u003c/p\u003e \u003cp\u003eThis table is the central result. It shows that \u003cb\u003eevery model\u0026rsquo;s hallucination rate is a range, not a number\u003c/b\u003e. GPT-4o\u0026rsquo;s rate spans 7.2%\u0026ndash;27.8% (3.9\u0026times;); Llama\u0026rsquo;s spans 9.8%\u0026ndash;43.8% (4.5\u0026times;). Under GPT-4o Conservative scoring, GPT-4o ranks #1; under Aggressive scoring, Claude-Sonnet ranks #1. Any benchmark reporting a single hallucination rate without disclosing the judge model and scoring regime is presenting a methodological artifact as a finding.\u003c/p\u003e"},{"header":"7. Discussion","content":"\u003cdiv id=\"Sec13\" class=\"Section2\"\u003e \u003ch2\u003e7.1 The LLM-as-Judge Failure Mode\u003c/h2\u003e \u003cp\u003eOur human evaluation reveals that the GPT-4o judge\u0026rsquo;s \u0026ldquo;ambiguous\u0026rdquo; category is not a genuine uncertainty signal but a systematic failure to classify hallucinations. With a 93% error rate on ambiguous verdicts (77% were hallucinations, 16% truthful, only 7% genuinely ambiguous), the judge\u0026rsquo;s primary failure mode is false negatives\u0026mdash;missing hallucinations\u0026mdash;not false positives. This has direct implications for any benchmark, leaderboard, or safety evaluation that relies on LLM-as-judge without human adjudication.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec14\" class=\"Section2\"\u003e \u003ch2\u003e7.2 Implications for Benchmark Design\u003c/h2\u003e \u003cp\u003eWe recommend: (1) \u003cb\u003eReport scoring regime explicitly\u003c/b\u003e and present results under multiple regimes. (2) \u003cb\u003eUse dual-judge evaluation\u003c/b\u003e with at least two independent judge models. (3) \u003cb\u003eInclude human adjudication\u003c/b\u003e on the ambiguous category\u0026mdash;our results show this resolves 93% of cases. (4) \u003cb\u003eReport judge ambiguity rates\u003c/b\u003e per model, as these are tier-dependent and affect rankings. (5) \u003cb\u003eSeparate coverage from safety\u003c/b\u003e: model abstention is a safety mechanism, not a failure mode. (6) \u003cb\u003eConsider epistemic routing\u003c/b\u003e: our web search verification shows that factual lookup questions are better verified by search engines than LLM judges, suggesting that benchmark evaluation pipelines should classify queries and route factual items to search-based verification.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec15\" class=\"Section2\"\u003e \u003ch2\u003e7.3 Implications for Leaderboards\u003c/h2\u003e \u003cp\u003eCurrent leaderboards that use LLM-as-judge evaluation may produce results that are sensitive to ambiguity handling in ways that disproportionately affect open-weight models. Under conservative scoring, Llama appears comparable to frontier models (9.8% vs 10.1%). Under aggressive scoring, it appears 4.3\u0026times; worse (43.8% vs 10.1%). \u003cb\u003eWithout reporting the scoring regime and judge model, leaderboard rankings are not reproducible and may reflect evaluation-induced bias rather than capability differences.\u003c/b\u003e We recommend leaderboards adopt the Exclude regime or report all three regimes, and disclose ambiguity rates per model.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec16\" class=\"Section2\"\u003e \u003ch2\u003e7.4 The Truthfulness Definition Problem\u003c/h2\u003e \u003cp\u003eDuring human labelling, annotators identified cases where the model\u0026rsquo;s answer was semantically correct but literally imprecise (e.g., Gandhi\u0026rsquo;s misattributed quote conveys the right meaning but wrong provenance). This reveals a deeper problem: benchmarks implicitly define \u0026ldquo;truthful\u0026rdquo; as literal factual accuracy, while users and even expert annotators often assess truthfulness as semantic accuracy. This definitional gap contributes to judge ambiguity and is itself an unreported methodological variable.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec17\" class=\"Section2\"\u003e \u003ch2\u003e7.5 Limitations\u003c/h2\u003e \u003cp\u003eWe evaluate on TruthfulQA only; replication on additional benchmarks is needed. Our human evaluation uses consensus labelling by three annotators rather than independent labelling with inter-annotator agreement metrics, which limits the strength of the human ground-truth claim. The extrapolation from 100 sampled items to all 668 ambiguous responses assumes the stratified sample is representative. Future work should include full independent labelling with Cohen\u0026rsquo;s κ computation and replication on at least two additional benchmarks.\u003c/p\u003e \u003c/div\u003e"},{"header":"8. Conclusion","content":"\u003cp\u003eWe have demonstrated that multi-model LLM benchmark results are critically sensitive to scoring methodology. The 3.5\u0026times; range in hallucination rates from scoring regime choice, the judge-dependent ambiguity rates (22.4% vs 13.7%), the systematic bias against open models, and the human finding that 77% of ambiguous verdicts are missed hallucinations all point to fundamental gaps in current evaluation practice. The human-corrected ground-truth rate of 26.1% reveals that conservative scoring\u0026mdash;the implicit default in most published work\u0026mdash;underestimates hallucination by 3\u0026times;. We propose explicit regime reporting, dual-judge evaluation, human adjudication of ambiguous verdicts, and structured epistemic output as necessary foundations for reproducible multi-model LLM assessment.\u003c/p\u003e"},{"header":"Declarations","content":"\u003ch2\u003eAcknowledgements\u003c/h2\u003e \u003cp\u003eWe thank our third annotator for participating in the consensus labelling of 100 ambiguous responses.\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eFarquhar, S., Kossen, J., Kuhn, L. \u0026amp; Gal, Y. (2024). Detecting hallucinations in large language models using semantic entropy. \u003cem\u003eNature\u003c/em\u003e, 630, 625\u0026ndash;630.\u003c/li\u003e\n \u003cli\u003eKruse, M., Afshar, M., Khatwani, S. et al. (2025). Simple yet effective: An information-theoretic approach to multi-LLM uncertainty quantification. \u003cem\u003eProc. EMNLP 2025\u003c/em\u003e, 30481\u0026ndash;30492.\u003c/li\u003e\n \u003cli\u003eLin, S., Hilton, J. \u0026amp; Evans, O. (2022). TruthfulQA: Measuring how models mimic human falsehoods. \u003cem\u003eProc. ACL 2022\u003c/em\u003e, 3214\u0026ndash;3252.\u003c/li\u003e\n \u003cli\u003eManakul, P., Liusie, A. \u0026amp; Gales, M.J.F. (2023). SelfCheckGPT: Zero-resource black-box hallucination detection. \u003cem\u003eProc. EMNLP 2023\u003c/em\u003e, 9004\u0026ndash;9017.\u003c/li\u003e\n \u003cli\u003eWang, X., Wei, J., Schuurmans, D. et al. (2023). Self-consistency improves chain of thought reasoning. \u003cem\u003eProc. ICLR 2023.\u003c/em\u003e\u003c/li\u003e\n \u003cli\u003eZheng, L., Chiang, W.-L., Sheng, Y. et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. \u003cem\u003eProc. NeurIPS 2023 Datasets and Benchmarks.\u003c/em\u003e\u003c/li\u003e\n\u003c/ol\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":true,"hideJournal":true,"highlight":"","institution":"","isAcceptedByJournal":false,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true},"keywords":"LLM evaluation, hallucination detection, benchmark methodology, LLM-as-judge, TruthfulQA","lastPublishedDoi":"10.21203/rs.3.rs-9240163/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-9240163/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eMulti-model LLM benchmarks increasingly rely on LLM-as-judge evaluation to measure hallucination. We identify a critical methodological problem: benchmark results change dramatically based on how three response categories are handled\u0026mdash;\u003cem\u003eepistemic abstentions\u003c/em\u003e, \u003cem\u003epolicy refusals\u003c/em\u003e, and \u003cem\u003ejudge-ambiguous responses\u003c/em\u003e. On TruthfulQA (N\u0026thinsp;=\u0026thinsp;790, 5 models, 3,950 responses), we demonstrate that hallucination rates shift from 8.9% to 31.3% depending solely on the scoring regime\u0026mdash;a 3.5\u0026times; variation. Human evaluation of 100 stratified ambiguous responses by three annotators reveals that \u003cb\u003e77% of judge-ambiguous verdicts are missed hallucinations\u003c/b\u003e that the LLM judge failed to detect, establishing a ground-truth hallucination rate of 26.1%\u0026mdash;three times the rate reported under conservative scoring. A second independent judge (Claude) produces different ambiguity rates (13.7% vs 22.4%), confirming that findings are judge-dependent. We further show that judge ambiguity disproportionately affects open-weight models (34% ambiguous for Llama 70B vs 12.5% for Claude-Sonnet), creating evaluation-induced bias in benchmark rankings. Each model\u0026rsquo;s reported hallucination rate is not a number but a range: GPT-4o varies 7.2%\u0026ndash;27.8% and Llama 70B varies 9.8%\u0026ndash;43.8% across six evaluation conditions (2 judges \u0026times; 3 regimes). Model rankings change 5 times across these 6 conditions. We propose a three-regime scoring framework, recommend dual-judge evaluation with human adjudication, and introduce the concept of epistemic routing as a complementary verification mechanism.\u003c/p\u003e","manuscriptTitle":"The Scoring Problem in Multi-Model LLM Benchmarks: How Unreported Methodological Choices Change Hallucination Measurement by 3.5×","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2026-03-30 06:33:11","doi":"10.21203/rs.3.rs-9240163/v1","editorialEvents":[{"type":"communityComments","content":0}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"researchsquare","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":true,"externalIdentity":"","sideBox":"","snPcode":"","submissionUrl":"/submission","title":"Research Square","twitterHandle":"researchsquare","acdcEnabled":true,"dfaEnabled":false,"editorialSystem":"","reportingPortfolio":"","inReviewEnabled":false,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"b1b4fd6f-9acd-446b-9f8f-6e3b19523beb","owner":[],"postedDate":"March 30th, 2026","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"posted","subjectAreas":[{"id":65230703,"name":"Artificial Intelligence and Machine Learning"}],"tags":[],"updatedAt":"2026-03-30T06:33:11+00:00","versionOfRecord":[],"versionCreatedAt":"2026-03-30 06:33:11","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-9240163","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-9240163","identity":"rs-9240163","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2026) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-20T11:00:21.680559+00:00

License: CC-BY-4.0