Data Quality in Clinical Coding: A Critical Analysis and Preliminary Study
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Clinical coding is a vital yet complex component of healthcare practice. While automated coding systems have advanced significantly, they still rely on imperfect training data, which affects the quality of their predictions. A key issue contributing to this problem, often overlooked in current research, is the presence of errors and undercoding in widely used clinical coding datasets. In this work, we uncover substantial undercoding and annotation errors in commonly used datasets and present the first empirical study on their impact on the performance of automated clinical coding algorithms. We develop a three-stage pipeline combining a large language model (LLM)-based coding evidence extractor, a multiclass classifier trained on silver-labeled evidence from the MIMIC-IV [14] dataset, and a verification step using LLMs to assess the validity of each code. This approach reveals that approximately 80% of clinical notes in the MDACE [3] dataset and 86% of notes in CodiEsp [25] are likely to be undercoded or contain errors. Furthermore, correcting the errors leads to a relative improvement of 4% in precision and 7% in recall for the current state-of-the-art clinical coding model, PLM-ICD [9]. These findings make it clear that not only the algorithm, but also dataset integrity, plays a critical role in automated clinical coding.
CCS CONCEPTS
-
Computing methodologies → Natural language processing;
-
Applied computing → Health informatics.