Assessing Retrieval-Augmented Large Language Model Performance in Emergency Department ICD-10-CM Coding Compared to Human Coders

Eyal Klang
Idit Tessler
Donald U Apakama
Ethan Abbott
Benjamin S Glicksberg
Monique Arnold
Akini Moses
Ankit Sakhuja
Ali Soroush
Alexander W Charney
David L. Reich
Jolion McGreevy
Nicholas Gavin
Brendan Carr
Robert Freeman
Girish N Nadkarni

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Accurate medical coding is essential for clinical and administrative purposes but complicated, time-consuming, and biased. This study compares Retrieval-Augmented Generation (RAG)-enhanced LLMs to provider-assigned codes in producing ICD-10-CM codes from emergency department (ED) clinical records.

Methods

Retrospective cohort study using 500 ED visits randomly selected from the Mount Sinai Health System between January and April 2024. The RAG system integrated past 1,038,066 ED visits data (2021-2023) into the LLMs’ predictions to improve coding accuracy. Nine commercial and open-source LLMs were evaluated. The primary outcome was a head-to-head comparison of the ICD-10-CM codes generated by the RAG-enhanced LLMs and those assigned by the original providers. A panel of four physicians and two LLMs blindly reviewed the codes, comparing the RAG-enhanced LLM and provider-assigned codes on accuracy and specificity.

Findings

RAG-enhanced LLMs demonstrated superior performance to provider coders in both the accuracy and specificity of code assignments. In a targeted evaluation of 200 cases where discrepancies existed between GPT-4 and provider-assigned codes, human reviewers favored GPT-4 for accuracy in 447 instances, compared to 277 instances where providers’ codes were preferred (p<0.001). Similarly, GPT-4 was selected for its superior specificity in 509 cases, whereas human coders were preferred in only 181 cases (p<0.001). Smaller open-access models, such as Llama-3.1-70B, also demonstrated substantial scalability when enhanced with RAG, with 218 instances of accuracy preference compared to 90 for providers’ codes. Furthermore, across all models, the exact match rate between LLM-generated and provider-assigned codes significantly improved following RAG integration, with Qwen-2-7B increasing from 0.8% to 17.6% and Gemma-2-9b-it improving from 7.2% to 26.4%.

Interpretation

RAG-enhanced LLMs improve medical coding accuracy in EDs, suggesting clinical workflow applications. These findings show that generative AI can improve clinical outcomes and reduce administrative burdens.

Funding

This work was supported in part through the computational and data resources and staff expertise provided by Scientific Computing and Data at the Icahn School of Medicine at Mount Sinai and supported by the Clinical and Translational Science Awards (CTSA) grant UL1TR004419 from the National Center for Advancing Translational Sciences. Research reported in this publication was also supported by the Office of Research Infrastructure of the National Institutes of Health under award number S10OD026880 and S10OD030463. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. The funders played no role in study design, data collection, analysis and interpretation of data, or the writing of this manuscript.

Twitter Summary

A study showed AI models with retrieval-augmented generation outperformed human doctors in ED diagnostic coding accuracy and specificity. Even smaller AI models perform favorably when using RAG. This suggests potential for reducing administrative burden in healthcare, improving coding efficiency, and enhancing clinical documentation.

Version published to 10.1101/2024.10.15.24315526 on medRxiv
Oct 17, 2024

Comparative Accuracy of Large Language Models for CPT Coding Assignments from Surgical Procedure Notes

This article has 4 authors:
1. Abdalrahman Katranji
2. Aisa De Vries
3. Abdalmajid Katranji
4. Mohammad Zalzaleh
This article has no evaluationsLatest version Jan 8, 2026
Benchmarking large language models for cardiovascular risk stratification using clinical vignettes

This article has 11 authors:
1. José Ferreira Santos
2. Regina Brito Duarte
3. Inês Mota
4. Rita Carvalheira Santos
5. José Maria Moreira
6. Joana Campos
7. Nuno André Silva
8. Bernardo Neves
9. Ricardo Ladeiras-Lopes
10. Francisca Leite
11. Helder Dores
This article has no evaluationsLatest version Dec 30, 2025
Integrating Agentic AI to Automate ICD-10 Medical Coding

This article has 6 authors:
1. Kitti Akkhawatthanakun
2. Lalita Narupiyakul
3. Konlakorn Wongpatikaseree
4. Narit Hnoohom
5. Chakkrit Termritthikun
6. Paisarn Muneesawang
This article has no evaluationsLatest version Dec 24, 2025

Discuss this preprint

Listed in

Abstract

Background

Methods

Findings

Interpretation

Funding

Twitter Summary

Article activity feed

Related articles

Comparative Accuracy of Large Language Models for CPT Coding Assignments from Surgical Procedure Notes

Benchmarking large language models for cardiovascular risk stratification using clinical vignettes

Integrating Agentic AI to Automate ICD-10 Medical Coding