Automatic ICD coding using LLMs: a systematic review

Adi Gershon
Shelly Soffer
Girish N Nadkarni
Eyal Klang

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Manual assignment of International Classification of Diseases (ICD) codes is error-prone. Transformer-based large language models (LLMs) have been proposed to automate coding, but their accuracy and generalizability remain uncertain.

Methods

We performed a systematic review registered with PROSPERO (CRD42024576236) and reported according to PRISMA guidelines. PubMed, Embase, and Google Scholar were searched through January 2025 for peer-reviewed studies that evaluated an LLM (e.g., BERT, GPT) for ICD coding and reported at least one performance metric. Two reviewers independently screened articles, extracted data, and assessed methodological quality with the Joanna Briggs Institute Critical Appraisal Checklist for Analytical Cross-Sectional Studies. Outcomes included micro-F1, macro-F1, accuracy, precision, recall, and AUC, capturing both overall predictive performance and sensitivity to rare ICD codes.

Results

Of 590 records screened, 35 studies met the inclusion criteria. 24 assessed general-purpose coding across broad clinical text, 10 focused on specific clinical contexts, and 11 addressed multilingual interoperability; some studies belonged to more than one theme. Median micro-F1 for frequent codes was 0.79 (range, 0.73–0.94), exceeding that of legacy machine-learning baselines in all comparative studies. Performance for infrequent codes was lower (median macro-F1, 0.42) but improved modestly with data augmentation, contrastive retrieval, or graph-based decoders. Only 1 study used federated learning across institutions, and 3 conducted external validation. The risk-of-bias assessment rated 18 studies (51%) as moderate, primarily due to unclear blinding of assessors and selective reporting.

Conclusions

LLM-based systems reliably automate common ICD codes and frequently match or surpass professional coders, but accuracy declines for rare diagnoses, and external validation is scant. Prospective, multicenter trials and transparent reporting of prompts and post-processing rules are required before clinical deployment.

Version published to 10.1101/2025.07.30.25330916 on medRxiv
Jul 30, 2025

Data Quality in Clinical Coding: A Critical Analysis and Preliminary Study

This article has 3 authors:
1. Supriya Khadka
2. Xiaorui Jiang
3. Vasile Palade
This article has no evaluationsLatest version Aug 26, 2025
ICD-11 MMS Implementation in Rare Disease Registries: Quantifying Coding Completeness and Statistical Reliability

This article has 5 authors:
1. Xue Bai
2. Jian Guo
3. Meng Zhang
4. Yi Wang
5. Naishi Li
This article has no evaluationsLatest version Aug 8, 2025
Underutilization of Syndrome-Specific ICD-10 Codes for Genetic Epilepsies: Implications for Precision Medicine

This article has 16 authors:
1. Émile Moura Coelho da Silva
2. Tobias Brünger
3. Gary Taylor
4. Mousumi Sinha
5. Alison Merket
6. Anu Cherukara
7. Sunanjay Bajaj
8. Jessica Clark
9. Ludovica Montanucci
10. Emily A. Huth
11. Mariana Fauteux
12. Samden D. Lhatoo
13. Christian M. Boßelmann
14. Costin Leu
15. Rahil A. Tai
16. Dennis Lal
This article has no evaluationsLatest version Sep 14, 2025

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

Data Quality in Clinical Coding: A Critical Analysis and Preliminary Study

ICD-11 MMS Implementation in Rare Disease Registries: Quantifying Coding Completeness and Statistical Reliability

Underutilization of Syndrome-Specific ICD-10 Codes for Genetic Epilepsies: Implications for Precision Medicine