Across Generations, Sizes, and Types, Large Language Models Poorly Report Self-Confidence in Gastroenterology Clinical Reasoning Tasks

doi:10.21203/rs.3.rs-6725427/v1

Across Generations, Sizes, and Types, Large Language Models Poorly Report Self-Confidence in Gastroenterology Clinical Reasoning Tasks

2025 · doi:10.21203/rs.3.rs-6725427/v1

preprint OA: closed CC-BY-4.0

📄 Open PDF Full text JSON View at publisher

Full text 122,855 characters · extracted from preprint-html · click to expand

Across Generations, Sizes, and Types, Large Language Models Poorly Report Self-Confidence in Gastroenterology Clinical Reasoning Tasks | Research Square window.SnipcartSettings = { analytics: { enabled: false } }; (function() { var accessVector = localStorage.getItem('access_vector') || ''; window.dataLayer = window.dataLayer || []; if (accessVector) { window.dataLayer.push({ user: { profile: { profileInfo: { snid: accessVector } } } }); } })(); (function(w,d,s,l,i){w[l]=w[l]||[];w[l].push({'gtm.start':new Date().getTime(),event:'gtm.js'});var f=d.getElementsByTagName(s)[0],j=d.createElement(s),dl=l!='dataLayer'?'&l='+l:'';j.async=true;j.src='https://www.googletagmanager.com/gtm.js?id='+i+dl;f.parentNode.insertBefore(j,f);})(window,document,'script','dataLayer','GTM-K279D39R'); Browse Preprints In Review Journals COVID-19 Preprints AJE Video Bytes Research Tools Research Promotion AJE Professional Editing AJE Rubriq About Preprint Platform In Review Editorial Policies Our Team Advisory Board Help Center Sign In Submit a Preprint Cite Share Download PDF Short Report Across Generations, Sizes, and Types, Large Language Models Poorly Report Self-Confidence in Gastroenterology Clinical Reasoning Tasks Nariman Naderi, Seyed Amir Ahmad Safavi-Naini, Thomas Savage, and 5 more This is a preprint; it has not been peer reviewed by a journal. https://doi.org/ 10.21203/rs.3.rs-6725427/v1 This work is licensed under a CC BY 4.0 License Status: Under Review Version 1 posted 10 You are reading this latest preprint version Abstract This study evaluated confidence calibration across 48 large language models (LLM) using 300 gastroenterology board exam style questions. Regardless of response accuracy, all models demonstrated poor certainty estimation. Even the best-calibrated systems (o1 preview, GPT-4o, Claude-3.5-Sonnet) showed substantial overconfidence (Brier scores 0.15-0.2, AUROC ~0.6). Most concerning, models maintained high certainty regardless of question difficulty or their actual knowledge limitations. This metacognitive deficiency poses significant challenges for safe clinical implementation of current LLMs in gastroenterology. Biological sciences/Computational biology and bioinformatics Health sciences/Gastroenterology Large Language Models Metacognition Artificial Intelligence Gastroenterology Uncertainty Quantification Figures Figure 1 Figure 2 Figure 3 Figure 4 Figure 5 Introduction Large Language Models (LLMs) are rapidly transforming healthcare, but their tendency to present incorrect information with convincing terminology, termed “hallucinations”, poses substantial safety concerns when used in clinical settings 1,2 . In high-stakes medical environments, such misinformation can lead to misdiagnosis, inappropriate treatment selection, or failure to identify critical patient conditions 3,4 . Communicating uncertainty is therefore essential for providing clinicians with reliable indicators of when model outputs should be treated with caution. Various approaches have been developed to quantify LLM uncertainty, ranging from intrinsic methods analyzing model internals (token probabilities, attention patterns) to extrinsic techniques (ensemble disagreement, calibration layers, surrogate models) 5–7 . However, many of these methods require substantial computational resources or specialized expertise to implement and interpret. Among these uncertainty quantification techniques, self-reported confidence through natural language offers unique practical advantages for clinical implementation. Unlike complex technical approaches that require specialized expertise to interpret, self-reported confidence provides immediately interpretable outputs in the same natural language format as the provided clinical information 8–11 . This approach requires minimal technical overhead and is intuitive to understand, making it particularly appealing for real-world healthcare applications where simplicity and interpretability are paramount. However, the effectiveness of LLM self-reported confidence depends critically on their language-based metacognition, or the ability to accurately monitor and evaluate one's own knowledge boundaries and reasoning processes. Prior research suggests LLM self-reported confidence assessments are poorly calibrated expressions, demonstrating a lack of meaningful metacognitive capabilities 12,13 . However, prior studies examined this confidence-accuracy gap have primarily examined general clinical reasoning tasks and a limited subset of model deployment conditions. Our systematic analysis across model architectures, parameter scales, and training methodologies examines whether confidence miscalibration represents a universal limitation or varies meaningfully across model families. These insights can inform the development of more trustworthy clinical AI systems and identify whether certain models and environments inherently promote better self-reported confidence. We evaluated self-reported confidence for 48 commercial and open-source LLMs across local, web, and API-based environments using the 2022 American College of Gastroenterology self-assessment examination containing 300 board exam-style multiple-choice questions. Gastroenterology was selected as our test domain primarily because the senior author (AS) is a practicing gastroenterologist, providing direct clinical insight into the impact of LLM confidence miscalibration. Subspecialty domains like gastroenterology also present unique challenges for clinical reasoning that make them ideal test environments since they require integration of diverse knowledge sources to formulate diagnoses and involve procedures with significant risks, where diagnostic or treatment errors can lead to serious patient harm. We used standardized board exam-style questions because they offer an objective benchmark for evaluating model performance across a range of clinically relevant scenarios. We employed a systematic approach where models were instructed to select the correct answer choice to each board exam question and explicitly report their confidence on a 0-10 scale (from least to most confident). Building on our established methodology, we optimized model parameters including prompt instructions, temperature settings, and token limit to maximize response accuracy. 14 A semi-automated extraction pipeline with human verification (99% accuracy, Supplementary Figure S1 ) was used to process the responses and confidence scores for subsequent analysis. We extracted 13,362 answers and 12,307 confidence scores ( Figure 1 ). The difference between these counts resulted primarily from non-compliance with prompt instructions (n=846) or from reasoning models that exhausted their token limits because of their internal reasoning dialogues (n=209) ( Supplementary Figure S2 ). Mean confidence scores ranged from 7.99 (95% CI: 7.89-8.09) for Claude-3-Opus to 9.58 (95% CI: 9.45-9.71) for Mistral-7b, while accuracy varied substantially from 30.3% (Llama3-8b-Q8) to 81.5% (o1 preview) ( Table 1 ). All models demonstrated systematic overconfidence, with average confidence consistently exceeding average accuracy ( Figure 2 ). We also observed a substantial overlap in confidence distributions between correct and incorrect responses, indicating limited discriminative capacity ( Figure 3 ). This means models expressed high certainty regardless of whether their answers were right or wrong—a critical safety issue in clinical settings. We quantified this observation through discrimination metrics. Even the best-performing model (o1 mini) achieved an Area Under the Receiver Operating Characteristic (AUROC) of only 0.626, well below the 0.7 threshold typically considered meaningful for clinical applications ( Table 1 ; Supplementary Figure S3 ). This pattern was consistent across all model families. Calibration measurements reinforced these findings. Only 5 of the 48 models demonstrated better-than-random performance. o1-preview showed the best calibration (Brier score: 0.157), followed by Claude-3.5-Sonnet (0.202) and GPT-4o (0.206) (Supplementary Figure S4) . The calibration curves in Figure 4 and Supplementary Figure S5 visually confirm this systematic overconfidence trend. Expected Calibration Error measurements reinforced these findings, with even the best models (o1-preview: 0.100, Claude-3.5-Sonnet: 0.122) showing meaningful deviations from ideal calibration ( Supplementary Figure S6 ). Perhaps most alarming for clinical applications, we found that models maintained high confidence levels even as their accuracy significantly decreased on more challenging questions ( Figure 5 ). This relationship was universally observed, as even the best-calibrated models ( Figure 5a-c ) showed the same overconfidence on difficult questions as poorly calibrated ones ( Figure 5d-f ). We also investigated whether question length affected confidence assessments, finding that confidence scores remained remarkably stable regardless of text complexity and had no meaningful relationship with actual performance (Supplementary Figure S7). This suggests models lack the awareness to recognize that longer, potentially more complex questions could reduce their response certainty. Looking at differences between model families, we observed generational improvements in self-assessed confidence performance. Newer versions consistently outperformed their predecessors. For example, o1 showed better calibration than GPT-4o, which in turn outperformed GPT-4 (Table 1) . Commercial models generally demonstrated superior uncertainty estimation compared to equivalent open-source alternatives, though this pattern had notable exceptions. We also found that quantization, while enabling deployment on less powerful hardware, typically degraded calibration quality (as seen when comparing Llama 3 8B with its quantized counterpart). Additional analysis of middle-performing models further confirmed these trends ( Supplementary Figure S5 ). Our findings confirm previous research highlighting the limitations of LLM self-reported confidence. 15,16 We extend those findings with three additional contributions. First, we present the most comprehensive cross-architectural evaluation to date, testing 48 LLMs—from 7 B to 175 B parameters—across commercial, open-source, and quantized deployments. Second, by using gastroenterology board-style questions, we deliver key domain-specific insights. Third, we show quantitatively that all models suffer a common metacognitive deficiency, in which even the best-calibrated LLMs remain systematically overconfident, regardless of question difficulty. This pervasive overconfidence transcends architecture, scale, and deployment environment, pointing to a fundamental limitation of current neural language models. While we observed that newer model generations (for example, o1 and Claude 3.5) achieve modestly better calibration metrics, 15 this improved calibration tracks closely with higher overall accuracy ( Table 1 ). It remains unclear whether these gains reflect genuine uncertainty awareness or simply byproducts of stronger performance 17 . In contrast, across all models, we observed high, unvarying self-reported confidence scores, irrespective of question difficulty ( Figure 5 ), model generation ( Figure 2 ), or correctness ( Figure 3 ). This core observation suggests that confidence outputs are no more than statistically likely text, not true reflections of internal uncertainty. In other words, LLMs are reciting the most probable “confidence score” token, rather than expressing insight into their own knowledge boundaries. Future efforts should explore architectural innovations or training objectives that explicitly foster genuine metacognitive capabilities—such as self-explanation or introspective feedback loops—rather than relying on incremental prompt or calibration tweaks. Several limitations temper our conclusions. While our use of multiple-choice gastroenterology board exam style questions offers a clear, objective benchmark, this approach may not generalize to open-ended clinical reasoning or to other medical specialties. Our standardized prompt engineering approach, which was designed to maximize accuracy, could itself bias models toward overconfident “expert” language. Finally, the ACG self-assessment questions are proprietary and only accessible to paying subscribers or via direct request for the structured data. We cannot rule out that LLMs may have seen these questions or similar content during their pretraining. Despite these caveats, our findings highlight critical AI safety gaps: LLMs uniformly overestimate their certainty, poorly discriminate correct from incorrect answers, and fail to adjust confidence for harder questions. In high-stakes clinical settings, reliable expressions of decision certainty are essential for safe human–AI collaboration. While newer LLM generations offer incremental calibration improvements, none approach the reliability required for autonomous decision support. Addressing this confidence-accuracy gap will be vital to protect patient safety and foster appropriate clinician trust as we integrate LLMs into healthcare workflows. Methods Reference dataset The 2022 American College of Gastroenterology (ACG) self-assessment consists of 300 questions, of which 138 contain images. These questions were developed by a committee of gastroenterologists to reflect the knowledge, skills, and attitudes required for excellent patient care, covering a broad range of topics, including liver, colon, esophagus, pancreaticobiliary, and endoscopy. The questions were designed to assess higher-order thinking skills and were primarily case based. They were validated through statistical analysis of test-takers' performance, with an average correctness rate of 74.52% ± 19.49% on the 2022 assessment, indicating a moderate level of difficulty. Only the text portions of the questions and answers were used in this study’s analyses. Questions were categorized by length (token count), difficulty (percentage of correct answers by test-takers), and patient care phase (treatment, diagnosis, or investigation). Additional details are provided in the Supplementary Section 1 . Response Generation and Confidence Score Elicitation For response generation and confidence score elicitation, we built upon our established methodology, 12 using 60 questions from the 2023 self-assessment exam and GPT-3.5 to select the model settings (temperature, maximum input, and output token count), prompt structure, and output format of all models. The configuration that maximized response accuracy was a temperature of 1, maximum token count of input token count + 512 output tokens, structured output approach, and prompt (Fig. 1 ). Among the various prompt engineering techniques evaluated, the following were identified as having a positive impact on the outcomes: expert mimicry, contextual embedding, Answer and Justify, Chain of Thought, confidence scoring, and direct questioning. OpenAI Web interface, OpenAI API, Claude Web interface, Claude API, Gemini Web interface, Poe Web interface, Firework API, and locally hosted hardware configurations such as RTX4090Ti and H100 systems were used for response generation and confidence score elicitation. Output Parsing To efficiently extract response and confidence data from the LLM outputs, we developed a structured output pipeline using GPT-4o (Fig. 1 ). Our hybrid methodology combined regex-based rules to reduce the number of input tokens and LLM-based extraction to effectively parse the key portions of the LLM outputs. Specifically, the pipeline identified sentences containing "confid" for further LLM-based parsing to either extract the certainty score (0–10) or define the score as "not_mentioned.” Sentences classified as "not_mentioned" in the first pass are passed through the LLM-based parsing step a second time to maximize the extraction performance. The complete output parsing methodology is described in Supplementary Section 1 . To validate the output parsing pipeline, we compared it against manually extracted confidence scores from five randomly selected questions per model, achieving 98.8% accuracy ( Supplementary Figure S1 ). Because some models did not reliably generate confidence scores, we excluded those that were missing confidence scores for more than 50% of the questions (Medicine-Chat Q8, OpenBioLLM-7B Q8, Qwen Qwq-32b, and GPT-3.5 Turbo). Supplementary Figure S2 describes the distribution of missing confidence scores, with 30 models having missing confidence scores. Supplementary Figure S8 illustrates a stratified analysis of response accuracy by confidence score missingness for models with missing scores for more than one-third of the questions. Statistical Analysis We evaluated each model’s performance from two perspectives: discrimination, the ability to distinguish between correct and incorrect responses, and calibration, the alignment between predicted confidence and actual accuracy. Discrimination was quantified using AUROC. Specifically, we designated each response as 1 (positive) if it was labeled “correct” and 0 (negative) otherwise. The confidence scores of the model ranged from 0 to 10 and served as the continuous predictor variable. We employed the roc_auc_score function from sklearn.metrics to calculate the AUROC. In practical terms, AUROC measures how well confidence scores can separate correct from incorrect answers, with 0.5 indicating random performance and 1.0 indicating perfect discrimination. Conceptually, this involves varying the decision threshold over all possible confidence values, thereby classifying the responses as positive or negative at each threshold. Calibration was evaluated using calibration plots, Brier score, and ECE. Calibration plots were generated by normalizing predicted confidence scores to a 0–1 scale, binning them into 0.1 intervals, and plotting the mean predicted confidence against the observed accuracy in each bin. Bins containing fewer than three predictions were excluded to ensure the reliability of the results. Bootstrap resampling (n = 1,000 iterations per bin) was used to derive 95% confidence intervals for each calibration point. The combination of these metrics provided comprehensive assessment of model uncertainty estimation. The ECE complements the Brier score by directly quantifying the aggregate discrepancy between predicted probabilities and observed outcomes across bins, whereas the Brier score measures the mean squared error between predictions and true labels. As a result, the Brier score reflects both calibration (how closely predicted probabilities match observed frequencies) and refinement (the sharpness of predictions), whereas ECE focuses more directly on calibration quality. Calculating both metrics provides a more comprehensive evaluation of model performance, capturing not only how well models are calibrated, but also the overall predictive accuracy of their probability estimates. Our development and analysis were performed using Python 3.10. LLM answers were generated and extracted using the Openai Python library, Ollama application (v0.4), LM studio, and Langchain (v0.2 and v0.3). Statistical analyses were conducted using SciPy (v1.13.1) and Scikit-learn (v1.5.1), with data manipulation and visualization implemented through Pandas (v2.2.2), Matplotlib (v3.9.2), and Seaborn (v0.13.2). Additional methodological details and code are available in our repository (see Code availability). Declarations Conflict of Interest Declaration : NN: none; SAASN: none; TS: none; MK: none; PL: none; ZA: none; GN: is a founder of Renalytix, Pensieve, and Verici and provides consultancy services to AstraZeneca, Reata, Renalytix, and Pensieve. He also has equity in Renalytix, Pensieve, and Verici.; AS: is on the advisory board and has equity in Virgo Surgical Video Solutions; Ethical considerations This study did not require ethical approval, as it did not involve human subjects or human data. We ensured data protection by confirming that the utilized LLM services did not retain or use our queries for model training purposes. Data availability The data supporting this study's findings were obtained from the American College of Gastroenterology (ACG) under license agreement. While these data are not publicly available owing to licensing restrictions, they may be obtained from the authors with the ACG’s permission upon reasonable request. ACG self-assessment questions and answers are accessible to members throughhttps://education.gi.org/. The datasets of models confidence score and correctness which is used in current study are available in the narimannr2x/confidence_scoring repository, https://github.com/narimannr2x/confidence_scoring. Code availability The underlying code for this study is available at https://github.com/narimannr2x/confidence_scoring. Acknowledgements This study was supported by the American Gastroenterological Association AGA-Amgen Fellowship-to-Faculty Transition Award (AGA2023-32-06) for AS. The funding source had no role in the study design, data collection, analysis, interpretation, or manuscript preparation. We thank the American College of Gastroenterology for providing their question bank, the Hugging Face team for their accessible AI infrastructure, and the theBloke account on Hugging Face for providing quantized versions of open-source LLMs. ChatGPT was used to assist with English language editing during manuscript preparation. The authors reviewed and edited all AI-assisted content and maintained full responsibility for the manuscript's content. Author contributions NN: Conceptualization, Methodology, Formal Analysis, Investigation, Data Curation, Writing Original Draft, Programming, Data Curation; SAASN: Methodology, Investigation, Validation, Review & Editing, Project Administration; AS: Methodology, Investigation, Supervision, Validation; TS: Investigation, Validation; GN: Supervision; ZA: Investigation; PL: Investigation; MK:Investigation. Competing Interests The authors declare no competing financial interests or personal relationships that could have influenced the work reported in this study. NN: none; SAASN: none; TS: none; MK: none; PL: none; ZA: none; GN: is a founder of Renalytix, Pensieve, and Verici and provides consultancy services to AstraZeneca, Reata, Renalytix, and Pensieve. He also has equity in Renalytix, Pensieve, and Verici.; AS: is on the advisory board and has equity in Virgo Surgical Solutions; References Liévin, V., Hother, C. E., Motzfeldt, A. G. & Winther, O. Can large language models reason about medical questions? Patterns 5 , (2024). Singhal, K. et al. Toward expert-level medical question answering with large language models. Nat. Med. 1–8 (2025) doi:10.1038/s41591-024-03423-7. Haltaufderheide, J. & Ranisch, R. The ethics of ChatGPT in medicine and healthcare: a systematic review on Large Language Models (LLMs). Npj Digit. Med. 7 , 1–11 (2024). McKenna, N. et al. Sources of Hallucination by Large Language Models on Inference Tasks. Preprint at https://doi.org/10.48550/arXiv.2305.14552 (2023). Xiong, M. et al. Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. Preprint at https://doi.org/10.48550/arXiv.2306.13063 (2024). Duan, J. et al. Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2307.01379 (2024). Wu, J., Yu, Y. & Zhou, H.-Y. Uncertainty Estimation of Large Language Models in Medical Question Answering. Preprint at https://doi.org/10.48550/arXiv.2407.08662 (2024). Tian, K. et al. Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (eds. Bouamor, H., Pino, J. & Bali, K.) 5433–5442 (Association for Computational Linguistics, Singapore, 2023). doi:10.18653/v1/2023.emnlp-main.330. Yang, D., Tsai, Y.-H. H. & Yamada, M. On Verbalized Confidence Scores for LLMs. Preprint at https://doi.org/10.48550/arXiv.2412.14737 (2024). Ni, S., Bi, K., Yu, L. & Guo, J. Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence? Preprint at https://doi.org/10.48550/arXiv.2408.09773 (2024). Omar, M., Agbareia, R., Glicksberg, B. S., Nadkarni, G. N. & Klang, E. Benchmarking the Confidence of Large Language Models in Clinical Questions. 2024.08.11.24311810 Preprint at https://doi.org/10.1101/2024.08.11.24311810 (2024). Savage, T. et al. Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. J. Am. Med. Inform. Assoc. JAMIA 32 , 139–149 (2025). Griot, M., Hemptinne, C., Vanderdonckt, J. & Yuksel, D. Large Language Models lack essential metacognition for reliable medical reasoning. Nat. Commun. 16 , 642 (2025). Safavi-Naini, S. A. A. et al. Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models. Preprint at https://doi.org/10.48550/arXiv.2409.00084 (2024). Omar, M., Agbareia, R., Glicksberg, B. S., Nadkarni, G. N. & Klang, E. Benchmarking the Confidence of Large Language Models in Clinical Questions. 2024.08.11.24311810 Preprint at https://doi.org/10.1101/2024.08.11.24311810 (2024). Vashurin, R. et al. Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph. Preprint at https://doi.org/10.48550/arXiv.2406.15627 (2024). Yang, D., Tsai, Y.-H. H. & Yamada, M. On Verbalized Confidence Scores for LLMs. Preprint at https://doi.org/10.48550/arXiv.2412.14737 (2024). Table Table 1. LLM accuracy, discrimination, calibration, and confidence scores were sorted from best calibration (lowest Brier score) to worst for each model family. Model family Model name and parameter (quantization) Date accessed Calibration Discrimination Accuracy Self-reported confidence score Brier score ECE AUROC Percent Mean (95CI) Llama Llama-3.3-70b December 2024 0.260 0.199 0.563 65.66 8.46 (8.36-8.56) Llama 3.1 405B August 2024 0.273 0.211 0.592 64 8.47 (8.38-8.57) Llama3.2-90B December 2024 0.302 0.269 0.600 60.00 8.49 (8.34-8.62) Llama 3.1 70B August 2024 0.313 0.283 0.538 58.19 8.51 (8.39-8.62) Llama 3 70B May 2024 0.334 0.301 0.572 54.66 8.38 (8.28-8.48) Llama 3 8B May 2024 0.422 0.450 0.478 43.33 8.54 (8.41-8.68) Llama-3.2-11b December 2024 0.400 0.390 0.519 48.65 8.59 (8.46-8.69) Llama 3.1 8B August 2024 0.433 0.441 0.512 43.14 8.67 (8.54-8.80) Llama-3.2-3b December 2024 0.465 0.487 0.534 35.66 8.32 (8.18-8.45) Llama 2 70B April 2024 0.481 0.493 0.529 37.71 8.70 (8.58-8.81) Llama-3.2-1b December 2024 0.500 0.511 0.455 30.61 8.13 (7.96-8.31) Llama 2 13B (Q5) April 2024 0.525 0.546 0.5 35.16 8.98 (8.92-9.04) Llama 3 8B (Q8) April 2024 0.527 0.613 0.472 30.35 8.65 (8.28-9.02) Llama 2 7B April 2024 0.528 0.587 0.47 30.87 8.66 (8.47-8.84) Llama 2 13B April 2024 0.531 0.558 0.52 33.11 8.89 (8.82-8.95) Llama 2 7B (Q8) April 2024 0.559 0.582 0.458 32.45 9.07 (8.98-9.15) Qwen Qwen-2.5-72b September 2024 0.326 0.304 0.549 61.48 8.39(8.15-8.63) Qwen-2-72B September 2024 0.364 0.360 0.583 57.00 9.10(8.98-9.20) Phi Phi-3 Medium 14B (Q6) April 2024 0.389 0.377 0.588 48.66 8.57 (8.48-8.67) Phi-3 3B FP16 April 2024 0.458 0.464 0.486 43.79 8.96 (8.84-9.07) Phi-3.5-4b December 2024 0.558 0.578 0.465 31.86 8.96 (8.90-9.02) Google Gemini Advanced Web March-April 2024 0.297 0.247 0.561 58.49 8.20 (8.07-8.33) Gemma 2 27B July 2024 0.374 0.352 0.557 50 8.52 (8.41-8.63) Gemma 2 9B (Q8) July 2024 0.397 0.392 0.543 45.33 8.40 (8.30-8.50) Gemma 2 9B July 2024 0.398 0.390 0.592 44.59 8.33 (8.20-8.45) Gemini Web March 2024 0.421 0.420 0.563 44.44 8.61 (8.53-8.70) Mistral Mistral Large April 2024 0.282 0.224 0.602 60.53 8.13 (7.98-8.28) Mixtral 8x7B April 2024 0.359 0.336 0.547 54.33 8.79 (8.72-8.87) Mistral v2 Q8 April 2024 0.506 0.527 0.554 39.06 9.11 (8.90-9.32) Mistral 7B April 2024 0.547 0.551 0.519 40.66 Claude Claude 3.5 Sonnet July 2024 0.207 0.122 0.6 74 8.60 (8.54-8.67) Claude 3 Opus March-April 2024 0.229 0.150 0.575 70.35 8.54 (8.44-8.63) Claude 3 Opus Web March-April 2024 0.246 0.154 0.578 65.66 7.99 (7.89-8.09) Claude 3 Sonnet Web March-April 2024 0.326 0.284 0.551 55.33 8.37 (8.29-8.45) Claude 3 Sonnet March-April 2024 0.361 0.336 0.559 51.17 8.48 (8.39-8.58) Claude 3 Haiku March-April 2024 0.373 0.357 0.522 53.76 8.88 (8.80-8.96) Claude 3 Haiku Web March-April 2024 0.398 0.385 0.523 50 8.85 (8.80-8.90) GPT o1 preview September 2024 0.157 0.100 0.576 81.57 9.15 (9.10-9.20) GPT-4o May 2024 0.208 0.148 0.604 74 8.86 (8.80-8.92) GPT-4 Web March 2024 0.267 0.221 0.588 66.22 8.79 (8.70-8.87) GPT-4 March 2024 0.278 0.237 0.605 66.53 9.02 (8.92-9.13) o1 Mini September 2024 0.278 0.257 0.626 66.33 9.20 (9.12-9.27) GPT-4o Mini July 2024 0.342 0.309 0.572 56.61 8.75 (8.67-8.83) GPT-3.5 Web March 2024 0.394 0.375 0.546 47.66 8.56 (8.48-8.63) Additional Declarations Competing interest reported. The authors declare no competing financial interests or personal relationships that could have influenced the work reported in this study. NN: none; SAASN: none; TS: none; MK: none; PL: none; ZA: none; GN: is a founder of Renalytix, Pensieve, and Verici and provides consultancy services to AstraZeneca, Reata, Renalytix, and Pensieve. He also has equity in Renalytix, Pensieve, and Verici.; AS: is on the advisory board and has equity in Virgo Surgical Solutions; Supplementary Files Supplementary.pdf Cite Share Download PDF Status: Under Review Version 1 posted Editorial decision: Revision requested 04 Aug, 2025 Reviews received at journal 03 Aug, 2025 Reviewers agreed at journal 22 Jul, 2025 Reviews received at journal 20 Jul, 2025 Reviewers agreed at journal 03 Jun, 2025 Reviewers agreed at journal 01 Jun, 2025 Reviewers invited by journal 01 Jun, 2025 Editor assigned by journal 27 May, 2025 Submission checks completed at journal 26 May, 2025 First submitted to journal 22 May, 2025 You are reading this latest preprint version Research Square lets you share your work early, gain feedback from the community, and start making changes to your manuscript prior to peer review in a journal. As a division of Research Square Company, we’re committed to making research communication faster, fairer, and more useful. We do this by developing innovative software and high quality services for the global research community. Our growing team is made up of researchers and industry professionals working together to solve the most critical problems facing scientific publishing. Also discoverable on Platform About Our Team In Review Editorial Policies Advisory Board Help Center Resources Author Services Accessibility API Access RSS feed Manage Cookie Preferences © Research Square 2026 | ISSN 2693-5015 (online) Privacy Policy Terms of Service Do Not Sell My Personal Information {"props":{"pageProps":{"initialData":{"identity":"rs-6725427","acceptedTermsAndConditions":true,"allowDirectSubmit":false,"archivedVersions":[],"articleType":"Short Report","associatedPublications":[],"authors":[{"id":464938006,"identity":"086a367a-94ad-4c14-970d-4c81b3e1a0b7","order_by":0,"name":"Nariman Naderi","email":"","orcid":"","institution":"Icahn School of Medicine at Mount Sinai","correspondingAuthor":false,"prefix":"","firstName":"Nariman","middleName":"","lastName":"Naderi","suffix":""},{"id":464938007,"identity":"a8464c68-1cd8-4f32-8f8d-4178d5728128","order_by":1,"name":"Seyed Amir Ahmad Safavi-Naini","email":"","orcid":"","institution":"Icahn School of Medicine at Mount Sinai","correspondingAuthor":false,"prefix":"","firstName":"Seyed","middleName":"Amir Ahmad","lastName":"Safavi-Naini","suffix":""},{"id":464938008,"identity":"36e4e187-132a-413d-b475-af0015f0f311","order_by":2,"name":"Thomas Savage","email":"","orcid":"","institution":"University of Pennsylvania","correspondingAuthor":false,"prefix":"","firstName":"Thomas","middleName":"","lastName":"Savage","suffix":""},{"id":464938009,"identity":"9d70371c-67d7-4b55-b2ae-2dd8bb89bfa4","order_by":3,"name":"Mohammad Amin Khalafi","email":"","orcid":"","institution":"Shahid Beheshti University of Medical Sciences","correspondingAuthor":false,"prefix":"","firstName":"Mohammad","middleName":"Amin","lastName":"Khalafi","suffix":""},{"id":464938010,"identity":"e5dd2e21-d5ad-4aee-ab51-da46339d88ed","order_by":4,"name":"Peter Lewis","email":"","orcid":"","institution":"Ontario Tech University","correspondingAuthor":false,"prefix":"","firstName":"Peter","middleName":"","lastName":"Lewis","suffix":""},{"id":464938011,"identity":"d28cf80d-30b8-47af-ab4a-2c3689e70231","order_by":5,"name":"Zahra Atf","email":"","orcid":"","institution":"Ontario Tech University","correspondingAuthor":false,"prefix":"","firstName":"Zahra","middleName":"","lastName":"Atf","suffix":""},{"id":464938012,"identity":"ea7529cf-d0f3-4283-b147-15b73068e650","order_by":6,"name":"Girish Nadkarni","email":"","orcid":"","institution":"Icahn School of Medicine at Mount Sinai","correspondingAuthor":false,"prefix":"","firstName":"Girish","middleName":"","lastName":"Nadkarni","suffix":""},{"id":464938013,"identity":"ae91e8be-8822-4aa9-8a76-9da9eb5ff02a","order_by":7,"name":"Ali Soroush","email":"data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAAZAAAAAyAQMAAABI0h/eAAAABlBMVEX///8AAABVwtN+AAAACXBIWXMAAA7EAAAOxAGVKw4bAAABEElEQVRIie3PsUrEMBjA8YSALtGuX7kjvkJKoQ4Kvkqz6FIPoUtxKgjpUrjV0UfwEU4C3hLpGjiQ3uLk5lLhBpue6JLrrYL5Tx9pf3wJQj7fX+wQl98Txe12sCd8hJBfQob/6H7yM9ED2JI9BXekxF3xynjz8nxrCjW7CJRsuxvETucPTgIKl4TqPOZmdrnKtMopiCqqOYqnpnWvsQTLVDwamqyupRI1YAmUI3EPC6c46Qn+tKTRST6Q4EmGmxHCe4KOLFlkCRkIEnIybAlKJ4l6oqhO49Bk8STTV6I2PZlyiAHcT2HLar3uipQdNzr6yIozUc2Xb+H75pxB4L6YzfmlXwHpTrKrkS0+n8/3r/oCZplc5NghbdIAAAAASUVORK5CYII=","orcid":"","institution":"Icahn School of Medicine at Mount Sinai","correspondingAuthor":true,"prefix":"","firstName":"Ali","middleName":"","lastName":"Soroush","suffix":""}],"badges":[],"createdAt":"2025-05-22 13:23:23","currentVersionCode":1,"declarations":"","doi":"10.21203/rs.3.rs-6725427/v1","doiUrl":"https://doi.org/10.21203/rs.3.rs-6725427/v1","draftVersion":[],"editorialEvents":[],"editorialNote":"","failedWorkflow":false,"files":[{"id":83890122,"identity":"6f46e645-04c7-4d57-979c-3f65689a726f","added_by":"auto","created_at":"2025-06-04 07:52:07","extension":"png","order_by":1,"title":"Figure 1","display":"","copyAsset":false,"role":"figure","size":203333,"visible":true,"origin":"","legend":"\u003cp\u003eSummary illustration of pipeline for confidence score extraction from raw textual responses.\u003c/p\u003e","description":"","filename":"1.png","url":"https://assets-eu.researchsquare.com/files/rs-6725427/v1/08f5ffb60534d681aea5cd4d.png"},{"id":83890125,"identity":"ed656a87-4dce-417c-a4e3-b81cb3fb1988","added_by":"auto","created_at":"2025-06-04 07:52:07","extension":"png","order_by":2,"title":"Figure 2","display":"","copyAsset":false,"role":"figure","size":163239,"visible":true,"origin":"","legend":"\u003cp\u003eAverage accuracy versus average confidence scores for LLMs with more than 150 valid samples. The dashed red line indicates perfect calibration, that is, the alignment of the average accuracy and average confidence score. Models above this line are overconfident, whereas those below are under-confident. A subset of the data was magnified for clarity purposes.\u003c/p\u003e","description":"","filename":"2.png","url":"https://assets-eu.researchsquare.com/files/rs-6725427/v1/1a95f8b879216ff64dde957a.png"},{"id":83890124,"identity":"16e147fc-c8b1-4579-a613-63bdeabe8ca4","added_by":"auto","created_at":"2025-06-04 07:52:07","extension":"png","order_by":3,"title":"Figure 3","display":"","copyAsset":false,"role":"figure","size":186371,"visible":true,"origin":"","legend":"\u003cp\u003eLeft panel: Overall distribution of self-reported confidence scores and mean response accuracy (stars) for each model. Right panel: Distribution of self-reported confidence scores for each model stratified by response accuracy.\u003c/p\u003e","description":"","filename":"3.png","url":"https://assets-eu.researchsquare.com/files/rs-6725427/v1/e46deda50142c6ff55d34977.png"},{"id":83890127,"identity":"e7887f63-26d6-4548-bbb0-b23070240959","added_by":"auto","created_at":"2025-06-04 07:52:07","extension":"png","order_by":4,"title":"Figure 4","display":"","copyAsset":false,"role":"figure","size":178225,"visible":true,"origin":"","legend":"\u003cp\u003e\u003cstrong\u003eCalibration curves for the top six (a-f) and bottom three (g-i) models.\u003c/strong\u003e Confidence scores were binned into intervals of 0.1 across the range of 0 to 1, with the mean normalized confidence score for each bin plotted against the corresponding observed accuracy. The dashed line represents perfect calibration.\u003c/p\u003e","description":"","filename":"4.png","url":"https://assets-eu.researchsquare.com/files/rs-6725427/v1/4a84a2603c02df94cf549044.png"},{"id":83891077,"identity":"411e6a0a-02a5-400d-b8ac-2f629ecc0366","added_by":"auto","created_at":"2025-06-04 08:00:07","extension":"png","order_by":5,"title":"Figure 5","display":"","copyAsset":false,"role":"figure","size":114166,"visible":true,"origin":"","legend":"\u003cp\u003eSubfigures (a) to (f) illustrate the smoothed trends in accuracy and confidence scores of LLMs as a function of question difficulty, defined as the percentage of test-takers answering correctly. The questions were grouped into bins at 5% intervals to facilitate visualization. Across all models, confidence scores remained relatively stable despite increasing question difficulty and lower model response accuracy. Figures (a)–(c) highlight the three models with the lowest Brier scores (highest calibration), whereas Figures (d)–(f) display the three models with the highest Brier scores (lowest calibration).\u003c/p\u003e","description":"","filename":"5.png","url":"https://assets-eu.researchsquare.com/files/rs-6725427/v1/30ccd793233598d2a6a5c875.png"},{"id":83891493,"identity":"81c7a563-e905-4bbb-a3d5-66a308ebf360","added_by":"auto","created_at":"2025-06-04 08:08:12","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"manuscript-pdf","size":1655287,"visible":true,"origin":"","legend":"","description":"","filename":"manuscript.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6725427/v1/0f8b73da-0b9f-4dbf-b5d0-180756150b43.pdf"},{"id":83891078,"identity":"587a5788-a6a8-4c34-be00-53a3622269b9","added_by":"auto","created_at":"2025-06-04 08:00:07","extension":"pdf","order_by":0,"title":"","display":"","copyAsset":false,"role":"supplement","size":1886345,"visible":true,"origin":"","legend":"","description":"","filename":"Supplementary.pdf","url":"https://assets-eu.researchsquare.com/files/rs-6725427/v1/fffacd9068aee1c6033a20cb.pdf"}],"financialInterests":"Competing interest reported. The authors declare no competing financial interests or personal relationships that could have influenced the work reported in this study. NN: none; SAASN: none; TS: none; MK: none; PL: none; ZA: none; GN: is a founder of Renalytix, Pensieve, and Verici and provides consultancy services to AstraZeneca, Reata, Renalytix, and Pensieve. He also has equity in Renalytix, Pensieve, and Verici.; AS: is on the advisory board and has equity in Virgo Surgical Solutions;","formattedTitle":"Across Generations, Sizes, and Types, Large Language Models Poorly Report Self-Confidence in Gastroenterology Clinical Reasoning Tasks","fulltext":[{"header":"Introduction","content":"\u003cp\u003eLarge Language Models (LLMs) are rapidly transforming healthcare, but their tendency to present incorrect information with convincing terminology, termed \u0026ldquo;hallucinations\u0026rdquo;, poses substantial safety concerns when used in clinical settings \u003csup\u003e1,2\u003c/sup\u003e. In high-stakes medical environments, such misinformation can lead to misdiagnosis, inappropriate treatment selection, or failure to identify critical patient conditions \u003csup\u003e3,4\u003c/sup\u003e. Communicating uncertainty is therefore essential for providing clinicians with reliable indicators of when model outputs should be treated with caution.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eVarious approaches have been developed to quantify LLM uncertainty, ranging from intrinsic methods analyzing model internals (token probabilities, attention patterns) to extrinsic techniques (ensemble disagreement, calibration layers, surrogate models) \u003csup\u003e5\u0026ndash;7\u003c/sup\u003e. However, many of these methods require substantial computational resources or specialized expertise to implement and interpret.\u0026nbsp;Among these uncertainty quantification techniques, self-reported confidence through natural language offers unique practical advantages for clinical implementation. Unlike complex technical approaches that require specialized expertise to interpret, self-reported confidence provides immediately interpretable outputs in the same natural language format as the provided clinical information \u003csup\u003e8\u0026ndash;11\u003c/sup\u003e. This approach requires minimal technical overhead and is intuitive to understand, making it particularly appealing for real-world healthcare applications where simplicity and interpretability are paramount. However, the effectiveness of LLM self-reported confidence depends critically on their language-based metacognition, or the ability to accurately monitor and evaluate one\u0026apos;s own knowledge boundaries and reasoning processes.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003ePrior research suggests LLM self-reported confidence assessments are poorly calibrated expressions, demonstrating a lack of meaningful metacognitive capabilities \u003csup\u003e12,13\u003c/sup\u003e. However, prior studies examined this confidence-accuracy gap have primarily examined general clinical reasoning tasks and a limited subset of model deployment conditions. \u0026nbsp;Our systematic analysis across model architectures, parameter scales, and training methodologies examines whether confidence miscalibration represents a universal limitation or varies meaningfully across model families. These insights can inform the development of more trustworthy clinical AI systems and identify whether certain models and environments inherently promote better self-reported confidence.\u003c/p\u003e\n\u003cp\u003eWe evaluated self-reported confidence for 48 commercial and open-source LLMs across local, web, and API-based environments using the 2022 American College of Gastroenterology self-assessment examination containing 300 board exam-style multiple-choice questions. Gastroenterology was selected as our test domain primarily because the senior author (AS) is a practicing gastroenterologist, providing direct clinical insight into the impact of LLM confidence miscalibration. Subspecialty domains like gastroenterology also present unique challenges for clinical reasoning that make them ideal test environments since they require integration of diverse knowledge sources to formulate diagnoses and involve procedures with significant risks, where diagnostic or treatment errors can lead to serious patient harm. We used standardized board exam-style questions because they offer an objective benchmark for evaluating model performance across a range of clinically relevant scenarios.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWe employed a systematic approach where\u0026nbsp;models were instructed to select the correct answer choice to each board exam question and explicitly report their confidence on a 0-10 scale (from least to most confident). Building on our established methodology, we optimized model parameters including prompt instructions, temperature settings, and token limit to maximize response accuracy.\u003csup\u003e14\u003c/sup\u003e A semi-automated extraction pipeline with human verification (99% accuracy,\u003cstrong\u003e\u0026nbsp;Supplementary Figure S1\u003c/strong\u003e) was used to process the responses and confidence scores for subsequent analysis.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWe extracted 13,362 answers and 12,307 confidence scores (\u003cstrong\u003eFigure 1\u003c/strong\u003e). The difference between these counts resulted primarily from non-compliance with prompt instructions (n=846) or from reasoning models that exhausted their token limits because of their internal reasoning dialogues (n=209) (\u003cstrong\u003eSupplementary Figure S2\u003c/strong\u003e). Mean confidence scores ranged from 7.99 (95% CI: 7.89-8.09) for Claude-3-Opus to 9.58 (95% CI: 9.45-9.71) for Mistral-7b, while accuracy varied substantially from 30.3% (Llama3-8b-Q8) to 81.5% (o1 preview) (\u003cstrong\u003eTable 1\u003c/strong\u003e). All models demonstrated systematic overconfidence, with average confidence consistently exceeding average accuracy (\u003cstrong\u003eFigure 2\u003c/strong\u003e).\u003c/p\u003e\n\u003cp\u003eWe also observed a substantial overlap in confidence distributions between correct and incorrect responses, indicating limited discriminative capacity (\u003cstrong\u003eFigure 3\u003c/strong\u003e). This means models expressed high certainty regardless of whether their answers were right or wrong\u0026mdash;a critical safety issue in clinical settings. We quantified this observation through discrimination metrics. Even the best-performing model (o1 mini) achieved an Area Under the Receiver Operating Characteristic (AUROC) of only 0.626, well below the 0.7 threshold typically considered meaningful for clinical applications (\u003cstrong\u003eTable 1\u003c/strong\u003e;\u0026nbsp;\u003cstrong\u003eSupplementary Figure S3\u003c/strong\u003e). This pattern was consistent across all model families.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eCalibration measurements reinforced these findings. Only 5 of the 48 models demonstrated better-than-random performance. o1-preview showed the best calibration (Brier score: 0.157), followed by Claude-3.5-Sonnet (0.202) and GPT-4o (0.206)\u0026nbsp;\u003cstrong\u003e(Supplementary Figure S4)\u003c/strong\u003e. The calibration curves in\u0026nbsp;\u003cstrong\u003eFigure 4\u003c/strong\u003e and\u0026nbsp;\u003cstrong\u003eSupplementary Figure S5\u003c/strong\u003e visually confirm this systematic overconfidence trend. Expected Calibration Error measurements reinforced these findings, with even the best models (o1-preview: 0.100, Claude-3.5-Sonnet: 0.122) showing meaningful deviations from ideal calibration (\u003cstrong\u003eSupplementary Figure S6\u003c/strong\u003e).\u0026nbsp;\u003c/p\u003e\n\u003cp\u003ePerhaps most alarming for clinical applications, we found that models maintained high confidence levels even as their accuracy significantly decreased on more challenging questions (\u003cstrong\u003eFigure 5\u003c/strong\u003e). This relationship was universally observed, as even the best-calibrated models (\u003cstrong\u003eFigure 5a-c\u003c/strong\u003e) showed the same overconfidence on difficult questions as poorly calibrated ones (\u003cstrong\u003eFigure 5d-f\u003c/strong\u003e). We also investigated whether question length affected confidence assessments, finding that confidence scores remained remarkably stable regardless of text complexity and had no meaningful relationship with actual performance \u003cstrong\u003e(Supplementary Figure S7).\u0026nbsp;\u003c/strong\u003eThis suggests models lack the awareness to recognize that longer, potentially more complex questions could reduce their response certainty.\u003c/p\u003e\n\u003cp\u003eLooking at differences between model families, we observed generational improvements in self-assessed confidence performance. Newer versions consistently outperformed their predecessors. For example, o1 showed better calibration than GPT-4o, which in turn outperformed GPT-4 \u003cstrong\u003e(Table 1)\u003c/strong\u003e. Commercial models generally demonstrated superior uncertainty estimation compared to equivalent open-source alternatives, though this pattern had notable exceptions. We also found that quantization, while enabling deployment on less powerful hardware, typically degraded calibration quality (as seen when comparing Llama 3 8B with its quantized counterpart). Additional analysis of middle-performing models further confirmed these trends (\u003cstrong\u003eSupplementary Figure S5\u003c/strong\u003e).\u003c/p\u003e\n\u003cp\u003eOur findings confirm previous research highlighting the limitations of LLM self-reported confidence. \u003csup\u003e15,16\u003c/sup\u003e We extend those findings with three additional contributions. First, we present the most comprehensive cross-architectural evaluation to date, testing 48 LLMs\u0026mdash;from 7 B to 175 B parameters\u0026mdash;across commercial, open-source, and quantized deployments. Second, by using gastroenterology board-style questions, we deliver key domain-specific insights. Third, we show quantitatively that all models suffer a common metacognitive deficiency, in which even the best-calibrated LLMs remain systematically overconfident, regardless of question difficulty. This pervasive overconfidence transcends architecture, scale, and deployment environment, pointing to a fundamental limitation of current neural language models.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003eWhile we observed that newer model generations (for example, o1 and Claude 3.5) achieve modestly better calibration metrics, \u003csup\u003e15\u003c/sup\u003e this improved calibration tracks closely with higher overall accuracy (\u003cstrong\u003eTable 1\u003c/strong\u003e). It remains unclear whether these gains reflect genuine uncertainty awareness or simply byproducts of stronger performance \u003csup\u003e17\u003c/sup\u003e. In contrast, across all models, we observed high, unvarying self-reported confidence scores, irrespective of question difficulty (\u003cstrong\u003eFigure 5\u003c/strong\u003e), model generation (\u003cstrong\u003eFigure 2\u003c/strong\u003e), or correctness (\u003cstrong\u003eFigure 3\u003c/strong\u003e). This core observation suggests that confidence outputs are no more than statistically likely text, not true reflections of internal uncertainty. In other words, LLMs are reciting the most probable \u0026ldquo;confidence score\u0026rdquo; token, rather than expressing insight into their own knowledge boundaries. Future efforts should explore architectural innovations or training objectives that explicitly foster genuine metacognitive capabilities\u0026mdash;such as self-explanation or introspective feedback loops\u0026mdash;rather than relying on incremental prompt or calibration tweaks.\u003c/p\u003e\n\u003cp\u003eSeveral limitations temper our conclusions. While our use of multiple-choice gastroenterology board exam style questions offers a clear, objective benchmark, this approach may not generalize to open-ended clinical reasoning or to other medical specialties. Our standardized prompt engineering approach, which was designed to maximize accuracy, could itself bias models toward overconfident \u0026ldquo;expert\u0026rdquo; language. Finally, the ACG self-assessment questions are proprietary and only accessible to paying subscribers or via direct request for the structured data. We cannot rule out that LLMs may have seen these questions or similar content during their pretraining.\u003c/p\u003e\n\u003cp\u003eDespite these caveats, our findings highlight critical AI safety gaps: LLMs uniformly overestimate their certainty, poorly discriminate correct from incorrect answers, and fail to adjust confidence for harder questions. In high-stakes clinical settings, reliable expressions of decision certainty are essential for safe human\u0026ndash;AI collaboration. While newer LLM generations offer incremental calibration improvements, none approach the reliability required for autonomous decision support. Addressing this confidence-accuracy gap will be vital to protect patient safety and foster appropriate clinician trust as we integrate LLMs into healthcare workflows.\u003c/p\u003e"},{"header":"Methods","content":"\u003cp\u003e \u003cb\u003eReference dataset\u003c/b\u003e \u003c/p\u003e \u003cp\u003eThe 2022 American College of Gastroenterology (ACG) self-assessment consists of 300 questions, of which 138 contain images. These questions were developed by a committee of gastroenterologists to reflect the knowledge, skills, and attitudes required for excellent patient care, covering a broad range of topics, including liver, colon, esophagus, pancreaticobiliary, and endoscopy. The questions were designed to assess higher-order thinking skills and were primarily case based. They were validated through statistical analysis of test-takers' performance, with an average correctness rate of 74.52% \u0026plusmn; 19.49% on the 2022 assessment, indicating a moderate level of difficulty. Only the text portions of the questions and answers were used in this study\u0026rsquo;s analyses. Questions were categorized by length (token count), difficulty (percentage of correct answers by test-takers), and patient care phase (treatment, diagnosis, or investigation). Additional details are provided in the \u003cb\u003eSupplementary Section 1\u003c/b\u003e.\u003c/p\u003e\n\u003ch3\u003eResponse Generation and Confidence Score Elicitation\u003c/h3\u003e\n\u003cp\u003eFor response generation and confidence score elicitation, we built upon our established methodology,\u003csup\u003e\u003cspan citationid=\"CR12\" class=\"CitationRef\"\u003e12\u003c/span\u003e\u003c/sup\u003e using 60 questions from the 2023 self-assessment exam and GPT-3.5 to select the model settings (temperature, maximum input, and output token count), prompt structure, and output format of all models. The configuration that maximized response accuracy was a temperature of 1, maximum token count of input token count\u0026thinsp;+\u0026thinsp;512 output tokens, structured output approach, and prompt (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Among the various prompt engineering techniques evaluated, the following were identified as having a positive impact on the outcomes: expert mimicry, contextual embedding, Answer and Justify, Chain of Thought, confidence scoring, and direct questioning. OpenAI Web interface, OpenAI API, Claude Web interface, Claude API, Gemini Web interface, Poe Web interface, Firework API, and locally hosted hardware configurations such as RTX4090Ti and H100 systems were used for response generation and confidence score elicitation.\u003c/p\u003e \u003cp\u003e \u003c/p\u003e \u003cdiv id=\"Sec3\" class=\"Section2\"\u003e \u003ch2\u003eOutput Parsing\u003c/h2\u003e \u003cp\u003eTo efficiently extract response and confidence data from the LLM outputs, we developed a structured output pipeline using GPT-4o (Fig.\u0026nbsp;\u003cspan refid=\"Fig1\" class=\"InternalRef\"\u003e1\u003c/span\u003e). Our hybrid methodology combined regex-based rules to reduce the number of input tokens and LLM-based extraction to effectively parse the key portions of the LLM outputs. Specifically, the pipeline identified sentences containing \"confid\" for further LLM-based parsing to either extract the certainty score (0\u0026ndash;10) or define the score as \"not_mentioned.\u0026rdquo; Sentences classified as \"not_mentioned\" in the first pass are passed through the LLM-based parsing step a second time to maximize the extraction performance. The complete output parsing methodology is described in \u003cb\u003eSupplementary Section 1\u003c/b\u003e. To validate the output parsing pipeline, we compared it against manually extracted confidence scores from five randomly selected questions per model, achieving 98.8% accuracy (\u003cb\u003eSupplementary Figure \u003cspan refid=\"MOESM1\" class=\"InternalRef\"\u003eS1\u003c/span\u003e\u003c/b\u003e).\u003c/p\u003e \u003cp\u003eBecause some models did not reliably generate confidence scores, we excluded those that were missing confidence scores for more than 50% of the questions (Medicine-Chat Q8, OpenBioLLM-7B Q8, Qwen Qwq-32b, and GPT-3.5 Turbo). \u003cb\u003eSupplementary Figure S2\u003c/b\u003e describes the distribution of missing confidence scores, with 30 models having missing confidence scores. \u003cb\u003eSupplementary Figure S8\u003c/b\u003e illustrates a stratified analysis of response accuracy by confidence score missingness for models with missing scores for more than one-third of the questions.\u003c/p\u003e \u003c/div\u003e \u003cdiv id=\"Sec4\" class=\"Section2\"\u003e \u003ch2\u003eStatistical Analysis\u003c/h2\u003e \u003cp\u003eWe evaluated each model\u0026rsquo;s performance from two perspectives: discrimination, the ability to distinguish between correct and incorrect responses, and calibration, the alignment between predicted confidence and actual accuracy.\u003c/p\u003e \u003cp\u003eDiscrimination was quantified using AUROC. Specifically, we designated each response as 1 (positive) if it was labeled \u0026ldquo;correct\u0026rdquo; and 0 (negative) otherwise. The confidence scores of the model ranged from 0 to 10 and served as the continuous predictor variable. We employed the roc_auc_score function from sklearn.metrics to calculate the AUROC. In practical terms, AUROC measures how well confidence scores can separate correct from incorrect answers, with 0.5 indicating random performance and 1.0 indicating perfect discrimination. Conceptually, this involves varying the decision threshold over all possible confidence values, thereby classifying the responses as positive or negative at each threshold.\u003c/p\u003e \u003cp\u003eCalibration was evaluated using calibration plots, Brier score, and ECE. Calibration plots were generated by normalizing predicted confidence scores to a 0\u0026ndash;1 scale, binning them into 0.1 intervals, and plotting the mean predicted confidence against the observed accuracy in each bin. Bins containing fewer than three predictions were excluded to ensure the reliability of the results. Bootstrap resampling (n\u0026thinsp;=\u0026thinsp;1,000 iterations per bin) was used to derive 95% confidence intervals for each calibration point.\u003c/p\u003e \u003cp\u003eThe combination of these metrics provided comprehensive assessment of model uncertainty estimation. The ECE complements the Brier score by directly quantifying the aggregate discrepancy between predicted probabilities and observed outcomes across bins, whereas the Brier score measures the mean squared error between predictions and true labels. As a result, the Brier score reflects both calibration (how closely predicted probabilities match observed frequencies) and refinement (the sharpness of predictions), whereas ECE focuses more directly on calibration quality. Calculating both metrics provides a more comprehensive evaluation of model performance, capturing not only how well models are calibrated, but also the overall predictive accuracy of their probability estimates.\u003c/p\u003e \u003cp\u003eOur development and analysis were performed using Python 3.10. LLM answers were generated and extracted using the Openai Python library, Ollama application (v0.4), LM studio, and Langchain (v0.2 and v0.3). Statistical analyses were conducted using SciPy (v1.13.1) and Scikit-learn (v1.5.1), with data manipulation and visualization implemented through Pandas (v2.2.2), Matplotlib (v3.9.2), and Seaborn (v0.13.2). Additional methodological details and code are available in our repository (see Code availability).\u003c/p\u003e \u003c/div\u003e"},{"header":"Declarations","content":"\u003cp\u003e\u003cstrong\u003eConflict of Interest Declaration\u003c/strong\u003e: NN: none; SAASN: none; TS: none; MK: none; PL: none; ZA: none; GN: is a founder of Renalytix, Pensieve, and Verici and provides consultancy services to AstraZeneca, Reata, Renalytix, and Pensieve. He also has equity in Renalytix, Pensieve, and Verici.; AS: is on the advisory board and has equity in Virgo Surgical Video Solutions;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eEthical considerations\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study did not require ethical approval, as it did not involve human subjects or human data. We ensured data protection by confirming that the utilized LLM services did not retain or use our queries for model training purposes.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eData availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe data supporting this study\u0026apos;s findings were obtained from the American College of Gastroenterology (ACG) under license agreement. While these data are not publicly available owing to licensing restrictions, they may be obtained from the authors with the ACG\u0026rsquo;s permission upon reasonable request. ACG self-assessment questions and answers are accessible to members throughhttps://education.gi.org/.\u003c/p\u003e\n\u003cp\u003eThe datasets of models confidence score and correctness which is used in current study are available in \u003cstrong\u003ethe narimannr2x/confidence_scoring\u003c/strong\u003e repository, https://github.com/narimannr2x/confidence_scoring.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCode availability\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe underlying code for this study is available at https://github.com/narimannr2x/confidence_scoring.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAcknowledgements\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThis study was supported by the American Gastroenterological Association AGA-Amgen Fellowship-to-Faculty Transition Award (AGA2023-32-06) for AS. The funding source had no role in the study design, data collection, analysis, interpretation, or manuscript preparation.\u003c/p\u003e\n\u003cp\u003eWe thank the American College of Gastroenterology for providing their question bank, the Hugging Face team for their accessible AI infrastructure, and the theBloke account on Hugging Face for providing quantized versions of open-source LLMs. ChatGPT was used to assist with English language editing during manuscript preparation. The authors reviewed and edited all AI-assisted content and maintained full responsibility for the manuscript\u0026apos;s content.\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eAuthor contributions\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eNN:\u0026nbsp;Conceptualization, Methodology, Formal Analysis, Investigation, Data Curation, Writing Original Draft, Programming, Data Curation; SAASN: Methodology, Investigation, Validation, Review \u0026amp; Editing, Project Administration; AS: Methodology, Investigation, Supervision, Validation; TS: Investigation, Validation; GN: Supervision; ZA: \u0026nbsp;Investigation; PL: Investigation; MK:Investigation.\u0026nbsp;\u003c/p\u003e\n\u003cp\u003e\u003cstrong\u003eCompeting Interests\u003c/strong\u003e\u003c/p\u003e\n\u003cp\u003eThe authors declare no competing financial interests or personal relationships that could have influenced the work reported in this study. NN: none; SAASN: none; TS: none; MK: none; PL: none; ZA: none; GN: is a founder of Renalytix, Pensieve, and Verici and provides consultancy services to AstraZeneca, Reata, Renalytix, and Pensieve. He also has equity in Renalytix, Pensieve, and Verici.; AS: is on the advisory board and has equity in Virgo Surgical Solutions;\u003c/p\u003e"},{"header":"References","content":"\u003col\u003e\n \u003cli\u003eLi\u0026eacute;vin, V., Hother, C. E., Motzfeldt, A. G. \u0026amp; Winther, O. Can large language models reason about medical questions? \u003cem\u003ePatterns\u003c/em\u003e \u003cstrong\u003e5\u003c/strong\u003e, (2024).\u003c/li\u003e\n \u003cli\u003eSinghal, K. \u003cem\u003eet al.\u003c/em\u003e Toward expert-level medical question answering with large language models. \u003cem\u003eNat. Med.\u003c/em\u003e 1\u0026ndash;8 (2025) doi:10.1038/s41591-024-03423-7.\u003c/li\u003e\n \u003cli\u003eHaltaufderheide, J. \u0026amp; Ranisch, R. The ethics of ChatGPT in medicine and healthcare: a systematic review on Large Language Models (LLMs). \u003cem\u003eNpj Digit. Med.\u003c/em\u003e \u003cstrong\u003e7\u003c/strong\u003e, 1\u0026ndash;11 (2024).\u003c/li\u003e\n \u003cli\u003eMcKenna, N. \u003cem\u003eet al.\u003c/em\u003e Sources of Hallucination by Large Language Models on Inference Tasks. Preprint at https://doi.org/10.48550/arXiv.2305.14552 (2023).\u003c/li\u003e\n \u003cli\u003eXiong, M. \u003cem\u003eet al.\u003c/em\u003e Can LLMs Express Their Uncertainty? An Empirical Evaluation of Confidence Elicitation in LLMs. Preprint at https://doi.org/10.48550/arXiv.2306.13063 (2024).\u003c/li\u003e\n \u003cli\u003eDuan, J. \u003cem\u003eet al.\u003c/em\u003e Shifting Attention to Relevance: Towards the Predictive Uncertainty Quantification of Free-Form Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2307.01379 (2024).\u003c/li\u003e\n \u003cli\u003eWu, J., Yu, Y. \u0026amp; Zhou, H.-Y. Uncertainty Estimation of Large Language Models in Medical Question Answering. Preprint at https://doi.org/10.48550/arXiv.2407.08662 (2024).\u003c/li\u003e\n \u003cli\u003eTian, K. \u003cem\u003eet al.\u003c/em\u003e Just Ask for Calibration: Strategies for Eliciting Calibrated Confidence Scores from Language Models Fine-Tuned with Human Feedback. in \u003cem\u003eProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing\u003c/em\u003e (eds. Bouamor, H., Pino, J. \u0026amp; Bali, K.) 5433\u0026ndash;5442 (Association for Computational Linguistics, Singapore, 2023). doi:10.18653/v1/2023.emnlp-main.330.\u003c/li\u003e\n \u003cli\u003eYang, D., Tsai, Y.-H. H. \u0026amp; Yamada, M. On Verbalized Confidence Scores for LLMs. Preprint at https://doi.org/10.48550/arXiv.2412.14737 (2024).\u003c/li\u003e\n \u003cli\u003eNi, S., Bi, K., Yu, L. \u0026amp; Guo, J. Are Large Language Models More Honest in Their Probabilistic or Verbalized Confidence? Preprint at https://doi.org/10.48550/arXiv.2408.09773 (2024).\u003c/li\u003e\n \u003cli\u003eOmar, M., Agbareia, R., Glicksberg, B. S., Nadkarni, G. N. \u0026amp; Klang, E. Benchmarking the Confidence of Large Language Models in Clinical Questions. 2024.08.11.24311810 Preprint at https://doi.org/10.1101/2024.08.11.24311810 (2024).\u003c/li\u003e\n \u003cli\u003eSavage, T. \u003cem\u003eet al.\u003c/em\u003e Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. \u003cem\u003eJ. Am. Med. Inform. Assoc. JAMIA\u003c/em\u003e \u003cstrong\u003e32\u003c/strong\u003e, 139\u0026ndash;149 (2025).\u003c/li\u003e\n \u003cli\u003eGriot, M., Hemptinne, C., Vanderdonckt, J. \u0026amp; Yuksel, D. Large Language Models lack essential metacognition for reliable medical reasoning. \u003cem\u003eNat. Commun.\u003c/em\u003e \u003cstrong\u003e16\u003c/strong\u003e, 642 (2025).\u003c/li\u003e\n \u003cli\u003eSafavi-Naini, S. A. A. \u003cem\u003eet al.\u003c/em\u003e Vision-Language and Large Language Model Performance in Gastroenterology: GPT, Claude, Llama, Phi, Mistral, Gemma, and Quantized Models. Preprint at https://doi.org/10.48550/arXiv.2409.00084 (2024).\u003c/li\u003e\n \u003cli\u003eOmar, M., Agbareia, R., Glicksberg, B. S., Nadkarni, G. N. \u0026amp; Klang, E. Benchmarking the Confidence of Large Language Models in Clinical Questions. 2024.08.11.24311810 Preprint at https://doi.org/10.1101/2024.08.11.24311810 (2024).\u003c/li\u003e\n \u003cli\u003eVashurin, R. \u003cem\u003eet al.\u003c/em\u003e Benchmarking Uncertainty Quantification Methods for Large Language Models with LM-Polygraph. Preprint at https://doi.org/10.48550/arXiv.2406.15627 (2024).\u003c/li\u003e\n \u003cli\u003eYang, D., Tsai, Y.-H. H. \u0026amp; Yamada, M. On Verbalized Confidence Scores for LLMs. Preprint at https://doi.org/10.48550/arXiv.2412.14737 (2024).\u003c/li\u003e\n\u003c/ol\u003e"},{"header":"Table","content":"\u003cp\u003e\u003cstrong\u003eTable\u0026nbsp;1.\u003c/strong\u003e LLM accuracy, discrimination, calibration, and confidence scores were sorted from best calibration (lowest Brier score) to worst for each model family.\u003c/p\u003e\n\u003ctable border=\"1\" cellspacing=\"0\" cellpadding=\"0\" width=\"727\"\u003e\n \u003ctbody\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003eModel family\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eModel name and parameter (quantization)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eDate accessed\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd colspan=\"2\" valign=\"top\" style=\"width: 126px;\"\u003e\n \u003cp\u003eCalibration\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003eDiscrimination\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eAccuracy\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003eSelf-reported confidence score\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eBrier score\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003eECE\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003eAUROC\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003ePercent\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003eMean (95CI)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003eLlama\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama-3.3-70b\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eDecember 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.260\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.199\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.563\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e65.66\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.46 (8.36-8.56)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama 3.1 405B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eAugust 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.273\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.211\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.592\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e64\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.47 (8.38-8.57)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama3.2-90B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eDecember 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.302\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.269\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.600\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e60.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.49 (8.34-8.62)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama 3.1 70B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eAugust 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.313\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.283\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.538\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e58.19\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.51 (8.39-8.62)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama 3 70B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eMay 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.334\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.301\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.572\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e54.66\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.38 (8.28-8.48)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama 3 8B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eMay 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.422\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.450\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.478\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e43.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.54 (8.41-8.68)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama-3.2-11b\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eDecember 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.400\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.390\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.519\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e48.65\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.59 (8.46-8.69)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama 3.1 8B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eAugust 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.433\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.441\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.512\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e43.14\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.67 (8.54-8.80)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama-3.2-3b\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eDecember 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.465\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.487\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.534\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e35.66\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.32 (8.18-8.45)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama 2 70B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eApril 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.481\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.493\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.529\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e37.71\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.70 (8.58-8.81)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama-3.2-1b\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eDecember 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.500\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.511\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.455\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e30.61\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.13 (7.96-8.31)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama 2 13B (Q5)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eApril 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.525\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.546\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.5\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e35.16\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.98 (8.92-9.04)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama 3 8B (Q8)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eApril 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.527\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.613\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.472\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e30.35\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.65 (8.28-9.02)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama 2 7B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eApril 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.528\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.587\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.47\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e30.87\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.66 (8.47-8.84)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama 2 13B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eApril 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.531\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.558\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.52\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e33.11\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.89 (8.82-8.95)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eLlama 2 7B (Q8)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eApril 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.559\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.582\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.458\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e32.45\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e9.07 (8.98-9.15)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003eQwen\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eQwen-2.5-72b\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eSeptember 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.326\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.304\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.549\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e61.48\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.39(8.15-8.63)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eQwen-2-72B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eSeptember 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.364\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.360\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.583\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e57.00\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e9.10(8.98-9.20)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003ePhi\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003ePhi-3 Medium 14B (Q6)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eApril 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.389\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.377\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.588\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e48.66\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.57 (8.48-8.67)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003ePhi-3 3B FP16\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eApril 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.458\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.464\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.486\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e43.79\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.96 (8.84-9.07)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003ePhi-3.5-4b\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eDecember 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.558\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.578\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.465\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e31.86\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.96 (8.90-9.02)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003eGoogle\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eGemini Advanced Web\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eMarch-April 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.297\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.247\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.561\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e58.49\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.20 (8.07-8.33)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eGemma 2 27B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eJuly 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.374\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.352\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.557\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.52 (8.41-8.63)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eGemma 2 9B (Q8)\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eJuly 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.397\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.392\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.543\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e45.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.40 (8.30-8.50)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eGemma 2 9B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eJuly 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.398\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.390\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.592\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e44.59\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.33 (8.20-8.45)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eGemini Web\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eMarch 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.421\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.420\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.563\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e44.44\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.61 (8.53-8.70)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003eMistral\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eMistral Large\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eApril 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.282\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.224\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.602\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e60.53\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.13 (7.98-8.28)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eMixtral 8x7B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eApril 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.359\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.336\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.547\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e54.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.79 (8.72-8.87)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eMistral v2 Q8\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eApril 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.506\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.527\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.554\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e39.06\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e9.11 (8.90-9.32)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eMistral 7B\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eApril 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.547\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.551\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.519\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e40.66\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003eClaude\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eClaude 3.5 Sonnet\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eJuly 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.207\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.122\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.6\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e74\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.60 (8.54-8.67)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eClaude 3 Opus\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eMarch-April 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.229\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.150\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.575\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e70.35\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.54 (8.44-8.63)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eClaude 3 Opus Web\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eMarch-April 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.246\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.154\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.578\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e65.66\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e7.99 (7.89-8.09)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eClaude 3 Sonnet Web\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eMarch-April 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.326\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.284\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.551\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e55.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.37 (8.29-8.45)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eClaude 3 Sonnet\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eMarch-April 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.361\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.336\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.559\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e51.17\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.48 (8.39-8.58)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eClaude 3 Haiku\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eMarch-April 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.373\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.357\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.522\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e53.76\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.88 (8.80-8.96)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eClaude 3 Haiku Web\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eMarch-April 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.398\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.385\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.523\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e50\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.85 (8.80-8.90)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003eGPT\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eo1 preview\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eSeptember 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.157\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.100\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.576\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e81.57\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e9.15 (9.10-9.20)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eGPT-4o\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eMay 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.208\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.148\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.604\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e74\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.86 (8.80-8.92)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eGPT-4 Web\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eMarch 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.267\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.221\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.588\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e66.22\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.79 (8.70-8.87)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eGPT-4\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eMarch 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.278\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.237\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.605\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e66.53\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e9.02 (8.92-9.13)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eo1 Mini\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eSeptember 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.278\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.257\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.626\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e66.33\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e9.20 (9.12-9.27)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eGPT-4o Mini\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eJuly 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.342\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.309\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.572\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e56.61\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.75 (8.67-8.83)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003ctr\u003e\n \u003ctd valign=\"top\" style=\"width: 72px;\"\u003e\n \u003cp\u003e\u0026nbsp;\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 170px;\"\u003e\n \u003cp\u003eGPT-3.5 Web\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003eMarch 2024\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e0.394\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 50px;\"\u003e\n \u003cp\u003e0.375\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 85px;\"\u003e\n \u003cp\u003e0.546\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 76px;\"\u003e\n \u003cp\u003e47.66\u003c/p\u003e\n \u003c/td\u003e\n \u003ctd valign=\"top\" style=\"width: 123px;\"\u003e\n \u003cp\u003e8.56 (8.48-8.63)\u003c/p\u003e\n \u003c/td\u003e\n \u003c/tr\u003e\n \u003c/tbody\u003e\n\u003c/table\u003e"}],"fulltextSource":"","fullText":"","funders":[],"hasAdminPriorityOnWorkflow":false,"hasManuscriptDocX":true,"hasOptedInToPreprint":true,"hasPassedJournalQc":"","hasAnyPriority":false,"hideJournal":false,"highlight":"","institution":"","isAcceptedByJournal":true,"isAuthorSuppliedPdf":false,"isDeskRejected":"","isHiddenFromSearch":false,"isInQc":false,"isInWorkflow":false,"isPdf":false,"isPdfUpToDate":true,"isWithdrawnOrRetracted":false,"journal":{"display":true,"email":"[email protected]","identity":"npj-gut-and-liver","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [npj Gut and Liver](https://www.nature.com/npjgutliver)","snPcode":"44355","submissionUrl":"https://submission.springernature.com/new-submission/44355/3","title":"npj Gut and Liver","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true},"keywords":"Large Language Models, Metacognition, Artificial Intelligence, Gastroenterology, Uncertainty Quantification","lastPublishedDoi":"10.21203/rs.3.rs-6725427/v1","lastPublishedDoiUrl":"https://doi.org/10.21203/rs.3.rs-6725427/v1","license":{"name":"CC BY 4.0","url":"https://creativecommons.org/licenses/by/4.0/"},"manuscriptAbstract":"\u003cp\u003eThis study evaluated confidence calibration across 48 large language models (LLM) using 300 gastroenterology board exam style questions. Regardless of response accuracy, all models demonstrated poor certainty estimation. Even the best-calibrated systems (o1 preview, GPT-4o, Claude-3.5-Sonnet) showed substantial overconfidence (Brier scores 0.15-0.2, AUROC ~0.6). Most concerning, models maintained high certainty regardless of question difficulty or their actual knowledge limitations. This metacognitive deficiency poses significant challenges for safe clinical implementation of current LLMs in gastroenterology.\u003c/p\u003e","manuscriptTitle":"Across Generations, Sizes, and Types, Large Language Models Poorly Report Self-Confidence in Gastroenterology Clinical Reasoning Tasks","msid":"","msnumber":"","nonDraftVersions":[{"code":1,"date":"2025-06-04 07:52:03","doi":"10.21203/rs.3.rs-6725427/v1","editorialEvents":[{"type":"communityComments","content":0},{"type":"decision","content":"Revision requested","date":"2025-08-05T03:49:33+00:00","index":"","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-08-04T00:16:14+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"3877785737663261460673112243368738004","date":"2025-07-22T11:55:43+00:00","index":"hide","fulltext":""},{"type":"editorInvitedReview","content":"","date":"2025-07-21T00:15:05+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"327118947976793433658466273504622692390","date":"2025-06-04T02:11:01+00:00","index":"hide","fulltext":""},{"type":"reviewerAgreed","content":"239642672283230007398883290220957834897","date":"2025-06-02T01:53:45+00:00","index":"hide","fulltext":""},{"type":"reviewersInvited","content":"","date":"2025-06-02T01:46:10+00:00","index":"","fulltext":""},{"type":"editorAssigned","content":"","date":"2025-05-27T07:04:01+00:00","index":"","fulltext":""},{"type":"checksComplete","content":"","date":"2025-05-26T18:29:10+00:00","index":"","fulltext":""},{"type":"submitted","content":"npj Gut and Liver","date":"2025-05-22T13:17:43+00:00","index":"","fulltext":""}],"status":"published","journal":{"display":true,"email":"[email protected]","identity":"npj-gut-and-liver","isNatureJournal":false,"hasQc":true,"allowDirectSubmit":false,"externalIdentity":"","sideBox":"Learn more about [npj Gut and Liver](https://www.nature.com/npjgutliver)","snPcode":"44355","submissionUrl":"https://submission.springernature.com/new-submission/44355/3","title":"npj Gut and Liver","twitterHandle":"","acdcEnabled":true,"dfaEnabled":true,"editorialSystem":"stoa","reportingPortfolio":"NPJ","inReviewEnabled":true,"inReviewRevisionsEnabled":true}}],"origin":"","ownerIdentity":"940bd40f-5640-4b55-9003-f64a3b401db7","owner":[],"postedDate":"June 4th, 2025","published":true,"recentEditorialEvents":[],"rejectedJournal":[],"revision":"","amendment":"","status":"under-review","subjectAreas":[{"id":49355798,"name":"Biological sciences/Computational biology and bioinformatics"},{"id":49355799,"name":"Health sciences/Gastroenterology"}],"tags":[],"updatedAt":"2026-01-07T19:38:39+00:00","versionOfRecord":[],"versionCreatedAt":"2025-06-04 07:52:03","video":"","vorDoi":"","vorDoiUrl":"","workflowStages":[]},"version":"v1","identity":"rs-6725427","journalConfig":"researchsquare"},"__N_SSP":true},"page":"/article/[identity]/[[...version]]","query":{"redirect":"/article/rs-6725427","identity":"rs-6725427","version":["v1"]},"buildId":"XKTyCvWXoU3ODBz1xrDgd","isFallback":false,"isExperimentalCompile":false,"dynamicIds":[84888],"gssp":true,"scriptLoader":[]}

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: preprint-html ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00
unpaywall: last seen: 2026-05-24T02:00:01.246996+00:00

License: CC-BY-4.0