Can Large Language Models Replace Coding Specialists? Evaluating GPT Performance in Medical Coding Tasks

Yeli Feng

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Purpose: Large language Models (LLM), GPT in particular, have demonstrated near human-level performance in medical domain, from summarizing clinical notes and passing medical licensing examinations, to predictive tasks such as disease diagnoses and treatment recommendations. However, currently there is little research on their efficacy for medical coding, a pivotal component in health informatics, clinical trials, and reimbursement management. This study proposes a prompt framework and investigates its effectiveness in medical coding tasks. Methods: First, a medical coding prompt framework is proposed. This framework aims to improve the performance of complex coding tasks by leveraging state-of-the-art (SOTA) prompt techniques including meta prompt, multi-shot learning, and dynamic in-context learning to extract task specific knowledge. This framework is implemented with a combination of commercial GPT-4o and open-source LLM. Then its effectiveness is evaluated with three different coding tasks. Finally, ablation studies are presented to validate and analyze the contribution of each module in the proposed prompt framework. Results: On the MIMIC-IV dataset, the prediction accuracy is 68.1% over the 30 most frequent MS-DRG codes. The result is comparable to SOTA 69.4% that fine-tunes the open-source LLaMA model, to the best of our knowledge. And the top-5 accuracy is 90.0%. The clinical trial criteria coding task results in a macro F1 score of 68.4 on the CHIP-CTC test dataset in Chinese, close to 70.9, the best supervised model training method in comparison. For the less complex semantic coding task, our method results in a macro F1 score of 79.7 on the CHIP-STS test dataset in Chinese, which is not competitive with most supervised model training methods in comparison. Conclusion: This study demonstrates that for complex medical coding tasks, carefully designed prompt-based learning can achieve similar performance as SOTA supervised model training approaches. Currently, it can be very helpful assistants, but it does not replace human coding specialists. With the rapid advancement of LLM, their potential to reliably automate complex medical coding in the near future cannot be underestimated.

Version published to 10.21203/rs.3.rs-5750190/v1 on Research Square
Jan 8, 2025

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

This article has 8 authors:
1. Lu He
2. D. Phuong Do
3. Vishesh Girish Shet
4. Omar Farghaly
5. Priya Deshpande
6. Praveen Madiraju
7. Jiancheng Ye
8. Molly Beestrum
This article has no evaluationsLatest version Jan 16, 2026
Prompt-Orchestrated Large Language Models for Clinical Information Extraction

This article has 13 authors:
1. Livia Lilli
2. Andrea Rosati
3. Giovanni Paolo Tobia
4. Massimo Criscione
5. Federica Tomassini
6. Chiara Dachena
7. Alice Luraschi
8. Chiara Cantarini
9. Carolina De Maria
10. Luigi Congedo
11. Massimo Bernaschi
12. Stefano Patarnello
13. Anna Fagotti
This article has no evaluationsLatest version Jan 16, 2026
Comparative Accuracy of Large Language Models for CPT Coding Assignments from Surgical Procedure Notes

This article has 4 authors:
1. Abdalrahman Katranji
2. Aisa De Vries
3. Abdalmajid Katranji
4. Mohammad Zalzaleh
This article has no evaluationsLatest version Jan 8, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

Prompt-Orchestrated Large Language Models for Clinical Information Extraction

Comparative Accuracy of Large Language Models for CPT Coding Assignments from Surgical Procedure Notes