MAX-EVAL-11: A Comprehensive Benchmark for Evaluating Large Language Models on Full-Spectrum ICD-11 Medical Coding

doi:10.1101/2025.10.30.25339130

MAX-EVAL-11: A Comprehensive Benchmark for Evaluating Large Language Models on Full-Spectrum ICD-11 Medical Coding

2025 · doi:10.1101/2025.10.30.25339130

preprint OA: closed

📄 Open PDF Full text JSON View at publisher

Full text 3,409 characters · extracted from oa-doi-fallback · click to expand

Abstract MAX-EVAL-11 is constructed by converting MIMIC-III discharge summaries from ICD-9 to ICD-11 codes through systematic mapping, creating a synthetic diagnosis dataset of 10,000 clinical notes with comprehensive ICD-11 annotations spanning the complete taxonomy. Unlike existing partial-taxonomy benchmarks that rely on traditional precision-recall metrics, MAX-EVAL-11 introduces a clinically-informed evaluation framework that assigns weighted reward points based on code relevance ranking and diagnostic specificity. This ranking-based scoring system accounts for the varying clinical importance of correctly identifying primary diagnoses versus secondary conditions, better reflecting real-world medical coding accuracy requirements. Our comprehensive evaluation across state-of-the-art LLMs reveals significant performance variations: Claude 4 Sonnet achieves a weighted score of 0.433 with clinical precision of 43.3%, while Claude 3.7 Sonnet attains 0.396 with 37.2% clinical precision. Gemini Flash demonstrates a weighted score of 0.341 with 31.5% clinical precision. These results reveal substantial performance gaps even in advanced foundation models, underscoring the complexity of comprehensive ICD-11 coding and the need for specialized medical AI systems beyond general-purpose LLMs. The benchmark provides standardized evaluation through our novel weighted scoring methodology that prioritizes diagnostic accuracy and clinical relevance over simple code-matching metrics. MAX-EVAL-11 addresses critical gaps in medical AI evaluation infrastructure by supporting the transition from legacy ICD-9 systems to ICD-11, facilitating development of clinically validated automated coding solutions that can handle real-world diagnostic complexity at scale. Competing Interest Statement The authors have declared no competing interest. Funding Statement This study did not receive any funding Author Declarations I confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained. Yes The details of the IRB/oversight body that provided approval or exemption for the research described are given below: MIMIC-3 I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals. Yes I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance). Yes I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable. Yes Footnotes {nitish.dube{at}maxhealthcare.com, arjun.sharma{at}maxhealthcare.com}, sarthakdeshwal{at}duck.com Data Availability All data produced in the present study are available upon reasonable request to the authors

Text is read by the "Ask this paper" AI Q&A widget below. Extraction quality varies by source — PMC NXML preserves structure cleanly, OA-HTML may include some navigation residue, and OA-PDF can have broken hyphenation. The publisher copy (via DOI) is the canonical version.

My notes (saved in your browser only)

⚙ Ask this paper AI returns verbatim quotes from the full text · source: oa-doi-fallback ⓘ

Answers must be backed by verbatim quotes from this paper's full text. Hallucinated quotes are dropped automatically; if no verbatim passage answers the question, we say so. How this works

Citation neighborhood (no data yet)

We don't have any in-corpus citations linked to this paper yet. This is a recent paper (2025) — citers typically take a year or two to land, and the OpenAlex reference graph may still be filling in.

Source provenance

europepmc: last seen: 2026-05-20T01:45:00.602351+00:00