MAX-EVAL-11: A Comprehensive Benchmark for Evaluating Large Language Models on Full-Spectrum ICD-11 Medical Coding
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
MAX-EVAL-11 is constructed by converting MIMIC-III discharge summaries from ICD-9 to ICD-11 codes through systematic mapping, creating a synthetic diagnosis dataset of 10,000 clinical notes with comprehensive ICD-11 annotations spanning the complete taxonomy. Unlike existing partial-taxonomy benchmarks that rely on traditional precision-recall metrics, MAX-EVAL-11 introduces a clinically-informed evaluation framework that assigns weighted reward points based on code relevance ranking and diagnostic specificity. This ranking-based scoring system accounts for the varying clinical importance of correctly identifying primary diagnoses versus secondary conditions, better reflecting real-world medical coding accuracy requirements. Our comprehensive evaluation across state-of-the-art LLMs reveals significant performance variations: Claude 4 Sonnet achieves a weighted score of 0.433 with clinical precision of 43.3% , while Claude 3.7 Sonnet attains 0.396 with 37.2% clinical precision. Gemini Flash demonstrates a weighted score of 0.341 with 31.5% clinical precision. These results reveal substantial performance gaps even in advanced foundation models, underscoring the complexity of comprehensive ICD-11 coding and the need for specialized medical AI systems beyond general-purpose LLMs. The benchmark provides standardized evaluation through our novel weighted scoring methodology that prioritizes diagnostic accuracy and clinical relevance over simple code-matching metrics. MAX-EVAL-11 addresses critical gaps in medical AI evaluation infrastructure by supporting the transition from legacy ICD-9 systems to ICD-11, facilitating development of clinically validated automated coding solutions that can handle real-world diagnostic complexity at scale.