Large Language Models for Automated Icd-10 Coding of Obstetric Clinical Notes in Portuguese: Comparison With Human Coders
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Despite rapid advances in large language models (LLMs), automated ICD-10 coding of real-world clinical narratives remains unreliable. A key challenge lies not only in model limitations, but in the intrinsic ambiguity of clinical documentation and the substantial variability among human coders. Methods We benchmarked six general-purpose LLMs for hierarchical ICD-10 coding of 1,117 obstetric discharge summaries written in Brazilian Portuguese. Model performance was evaluated at both category and leaf levels and contextualized against blinded clinician validation to quantify realistic human agreement. We further assessed whether Portuguese-to-English translation or lightweight supervised fine-tuning improved performance. Results Even the best-performing model (GPT-4o) achieved only modest agreement with the human reference, with micro-F1 scores of 0.36 at the three-character level and 0.15 at the leaf level. Translation into English did not yield consistent gains, and direct fine-tuning on code–description pairs failed to improve accuracy. In clinician validation, human-coded references achieved a precision of 0.77, compared to 0.59 for the strongest model, revealing a substantial gap between automated predictions and clinically accepted codes. Conclusions Our findings indicate that current LLMs remain below human-level reliability for autonomous ICD-10 coding in Portuguese. More importantly, they show that conventional evaluation metrics such as F1-score substantially misrepresent clinical usefulness by conflating model error with human disagreement. The primary bottleneck for automated medical coding is therefore not model capacity, but the imperfect and variable human gold standard itself. LLMs should be positioned as decision-support tools that assist, rather than replace, expert clinical coders.