Automatic ICD coding using LLMs: a systematic review

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Manual assignment of International Classification of Diseases (ICD) codes is error-prone. Transformer-based large language models (LLMs) have been proposed to automate coding, but their accuracy and generalizability remain uncertain.

Methods

We performed a systematic review registered with PROSPERO (CRD42024576236) and reported according to PRISMA guidelines. PubMed, Embase, and Google Scholar were searched through January 2025 for peer-reviewed studies that evaluated an LLM (e.g., BERT, GPT) for ICD coding and reported at least one performance metric. Two reviewers independently screened articles, extracted data, and assessed methodological quality with the Joanna Briggs Institute Critical Appraisal Checklist for Analytical Cross-Sectional Studies. Outcomes included micro-F1, macro-F1, accuracy, precision, recall, and AUC, capturing both overall predictive performance and sensitivity to rare ICD codes.

Results

Of 590 records screened, 35 studies met the inclusion criteria. 24 assessed general-purpose coding across broad clinical text, 10 focused on specific clinical contexts, and 11 addressed multilingual interoperability; some studies belonged to more than one theme. Median micro-F1 for frequent codes was 0.79 (range, 0.73–0.94), exceeding that of legacy machine-learning baselines in all comparative studies. Performance for infrequent codes was lower (median macro-F1, 0.42) but improved modestly with data augmentation, contrastive retrieval, or graph-based decoders. Only 1 study used federated learning across institutions, and 3 conducted external validation. The risk-of-bias assessment rated 18 studies (51%) as moderate, primarily due to unclear blinding of assessors and selective reporting.

Conclusions

LLM-based systems reliably automate common ICD codes and frequently match or surpass professional coders, but accuracy declines for rare diagnoses, and external validation is scant. Prospective, multicenter trials and transparent reporting of prompts and post-processing rules are required before clinical deployment.

Article activity feed