MAX-EVAL-11: A Comprehensive Benchmark for Evaluating Large Language Models on Full-Spectrum ICD-11 Medical Coding

Ujjwal Singh
Sarthak Deshwal
Nitish Dube
Arjun Sharma

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

MAX-EVAL-11 is constructed by converting MIMIC-III discharge summaries from ICD-9 to ICD-11 codes through systematic mapping, creating a synthetic diagnosis dataset of 10,000 clinical notes with comprehensive ICD-11 annotations spanning the complete taxonomy. Unlike existing partial-taxonomy benchmarks that rely on traditional precision-recall metrics, MAX-EVAL-11 introduces a clinically-informed evaluation framework that assigns weighted reward points based on code relevance ranking and diagnostic specificity. This ranking-based scoring system accounts for the varying clinical importance of correctly identifying primary diagnoses versus secondary conditions, better reflecting real-world medical coding accuracy requirements. Our comprehensive evaluation across state-of-the-art LLMs reveals significant performance variations: Claude 4 Sonnet achieves a weighted score of 0.433 with clinical precision of 43.3% , while Claude 3.7 Sonnet attains 0.396 with 37.2% clinical precision. Gemini Flash demonstrates a weighted score of 0.341 with 31.5% clinical precision. These results reveal substantial performance gaps even in advanced foundation models, underscoring the complexity of comprehensive ICD-11 coding and the need for specialized medical AI systems beyond general-purpose LLMs. The benchmark provides standardized evaluation through our novel weighted scoring methodology that prioritizes diagnostic accuracy and clinical relevance over simple code-matching metrics. MAX-EVAL-11 addresses critical gaps in medical AI evaluation infrastructure by supporting the transition from legacy ICD-9 systems to ICD-11, facilitating development of clinically validated automated coding solutions that can handle real-world diagnostic complexity at scale.

Version published to 10.1101/2025.10.30.25339130 on medRxiv
Oct 31, 2025

Large language models for automatable real-world performance monitoring of diagnostic decision support systems: a comparison to manual doctor panel review in a prospective clinical study

This article has 20 authors:
1. Fabienne Cotte
2. Marcel Schmude
3. Philipp Bode
4. Oula Suliman
5. Filipa Dias Lourenço
6. Miguel Paiva Pereira
7. Nisha Kini
8. Vera Hartenstein
9. Allesandro Muscoloni
10. Lisa Stroux
11. Victor Hertz
12. Sebastian Köhler
13. Valerio Morelli
14. Henry Hoffmann
15. Peter Engerer
16. Stephen Gilbert
17. Kirsten Gray
18. Tauseef Mehrali
19. Micaela Seemann Monteiro
20. Pedro Flores
This article has no evaluationsLatest version Sep 21, 2025
Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models

This article has 4 authors:
1. Nazar Zaki
2. Amal Akor
3. Salahdein Aburuz
4. Sham ZainAlAbdin
This article has no evaluationsLatest version Oct 19, 2025
From Clinical Judgment to Large Language Models: Benchmarking Predictive Approaches for Unplanned Hospital Admissions

This article has 2 authors:
1. Bernardo Neves
2. Mário J. Silva
This article has no evaluationsLatest version Sep 12, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Large language models for automatable real-world performance monitoring of diagnostic decision support systems: a comparison to manual doctor panel review in a prospective clinical study

Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models

From Clinical Judgment to Large Language Models: Benchmarking Predictive Approaches for Unplanned Hospital Admissions