Data Quality in Clinical Coding: A Critical Analysis and Preliminary Study

Supriya Khadka
Xiaorui Jiang
Vasile Palade

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Clinical coding is a vital yet complex component of healthcare practice. While automated coding systems have advanced significantly, they still rely on imperfect training data, which affects the quality of their predictions. A key issue contributing to this problem, often overlooked in current research, is the presence of errors and undercoding in widely used clinical coding datasets. In this work, we uncover substantial undercoding and annotation errors in commonly used datasets and present the first empirical study on their impact on the performance of automated clinical coding algorithms. We develop a three-stage pipeline combining a large language model (LLM)-based coding evidence extractor, a multiclass classifier trained on silver-labeled evidence from the MIMIC-IV [14] dataset, and a verification step using LLMs to assess the validity of each code. This approach reveals that approximately 80% of clinical notes in the MDACE [3] dataset and 86% of notes in CodiEsp [25] are likely to be undercoded or contain errors. Furthermore, correcting the errors leads to a relative improvement of 4% in precision and 7% in recall for the current state-of-the-art clinical coding model, PLM-ICD [9]. These findings make it clear that not only the algorithm, but also dataset integrity, plays a critical role in automated clinical coding.

CCS CONCEPTS

Computing methodologies → Natural language processing;
Applied computing → Health informatics.

Version published to 10.1101/2025.08.24.25334321 on medRxiv
Aug 26, 2025

MedError: A Machine-Assisted Framework for Systematic Error Analysis in Clinical Concept Extraction

This article has 18 authors:
1. Hongfang Liu
2. Sunyang Fu
3. Qiuhao Lu
4. Jaerong Ahn
5. Fang Chen
6. Hanyun Yin
7. Julia Wen
8. Zhiyi Yue
9. Taylor Harrison
10. Jiang Jun
11. Xiaoyang Ruan
12. Ming Huang
13. Andrew Wen
14. Liwei Wang
15. Min Ji Kwak
16. Nahid Rianon
17. Yanshan Wang
18. Ruihong Huang
This article has no evaluationsLatest version Sep 17, 2025
OpenCodeCounts : An open-access, interactive online tool and R package for analysing clinical code usage in England

This article has 14 authors:
1. Arina A Tamborska
2. Rose Higgins
3. Yamina Boukari
4. Viveck Kingsley
5. Lola Ojedele
6. Kunle Oreagba
7. Jon Massey
8. Andrea Schaffer
9. Amelia Green
10. William Hulme
11. Brian MacKenna
12. Helen J Curtis
13. Louis Fisher
14. Milan Wiedemann
This article has no evaluationsLatest version Oct 15, 2025
Vitabel: Bridging Clinical Expertise and the Machine Learning Pipeline in Critical Care

This article has 6 authors:
1. Simon Orlob
2. Wolfgang J. Kern
3. Benjamin Hackl
4. Jan Wnent
5. Jan-Thorsten Gräsner
6. Martin Holler
This article has no evaluationsLatest version Sep 29, 2025

Data Quality in Clinical Coding: A Critical Analysis and Preliminary Study

Discuss this preprint

Listed in

Abstract

CCS CONCEPTS

Article activity feed

MedError: A Machine-Assisted Framework for Systematic Error Analysis in Clinical Concept Extraction

OpenCodeCounts : An open-access, interactive online tool and R package for analysing clinical code usage in England

Vitabel: Bridging Clinical Expertise and the Machine Learning Pipeline in Critical Care

Discuss this preprint

Listed in

Abstract

CCS CONCEPTS

Article activity feed

Related articles

MedError: A Machine-Assisted Framework for Systematic Error Analysis in Clinical Concept Extraction

OpenCodeCounts : An open-access, interactive online tool and R package for analysing clinical code usage in England

Vitabel: Bridging Clinical Expertise and the Machine Learning Pipeline in Critical Care