Large Language Models for Automated Icd-10 Coding of Obstetric Clinical Notes in Portuguese: Comparison With Human Coders

Ricardo da Silva Santos
Murilo Gleyson Gazzola
Paulo Marcelino Figueira
Adriana Gomes Luz
Rodolfo de Carvalho Pacagnella
Cristiano Torezzan

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Despite rapid advances in large language models (LLMs), automated ICD-10 coding of real-world clinical narratives remains unreliable. A key challenge lies not only in model limitations, but in the intrinsic ambiguity of clinical documentation and the substantial variability among human coders. Methods We benchmarked six general-purpose LLMs for hierarchical ICD-10 coding of 1,117 obstetric discharge summaries written in Brazilian Portuguese. Model performance was evaluated at both category and leaf levels and contextualized against blinded clinician validation to quantify realistic human agreement. We further assessed whether Portuguese-to-English translation or lightweight supervised fine-tuning improved performance. Results Even the best-performing model (GPT-4o) achieved only modest agreement with the human reference, with micro-F1 scores of 0.36 at the three-character level and 0.15 at the leaf level. Translation into English did not yield consistent gains, and direct fine-tuning on code–description pairs failed to improve accuracy. In clinician validation, human-coded references achieved a precision of 0.77, compared to 0.59 for the strongest model, revealing a substantial gap between automated predictions and clinically accepted codes. Conclusions Our findings indicate that current LLMs remain below human-level reliability for autonomous ICD-10 coding in Portuguese. More importantly, they show that conventional evaluation metrics such as F1-score substantially misrepresent clinical usefulness by conflating model error with human disagreement. The primary bottleneck for automated medical coding is therefore not model capacity, but the imperfect and variable human gold standard itself. LLMs should be positioned as decision-support tools that assist, rather than replace, expert clinical coders.

Version published to 10.21203/rs.3.rs-8712058/v1 on Research Square
Feb 5, 2026

Comparative Accuracy of Large Language Models for CPT Coding Assignments from Surgical Procedure Notes

This article has 4 authors:
1. Abdalrahman Katranji
2. Aisa De Vries
3. Abdalmajid Katranji
4. Mohammad Zalzaleh
This article has no evaluationsLatest version Jan 8, 2026
Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude

This article has 2 authors:
1. Hasan Öz
2. Mehmet Dundar
This article has no evaluationsLatest version Feb 20, 2026
Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

This article has 8 authors:
1. Lu He
2. D. Phuong Do
3. Vishesh Girish Shet
4. Omar Farghaly
5. Priya Deshpande
6. Praveen Madiraju
7. Jiancheng Ye
8. Molly Beestrum
This article has no evaluationsLatest version Jan 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Comparative Accuracy of Large Language Models for CPT Coding Assignments from Surgical Procedure Notes

Impact of Query Language on Large Language Model Performance in Dental Trauma Management: A Comparative Evaluation of ChatGPT, Gemini, and Claude

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework