Diagnostic Codes in AI prediction models and Label Leakage of Same-admission Clinical Outcomes

Bashar Ramadan
Ming-Chieh Liu
Michael C. Burkhart
William F. Parker
Brett K. Beaulieu-Jones

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Importance

Artificial intelligence (AI) and statistical models designed to predict same-admission outcomes for hospitalized patients, such inpatient mortality, often rely on International Classification of Disease (ICD) diagnostic codes, even when these codes are not finalized until after hospital discharge.

Objective

Investigate the extent to which the inclusion of ICD codes as features in predictive models inflates performance metrics via “label leakage” (e.g. including the ICD code for cardiac arrest into an inpatient mortality prediction model) and assess the prevalence and implications of this practice in existing literature.

Design

Observational study of the MIMIC-IV deidentified inpatient electronic health record database and literature review.

Setting

Beth Israel Deaconess Medical Center.

Participants

Patients admitted to the hospital with either emergency room or ICU between 2008 and 2019

Main outcome and measures

Using a standard training-validation-test split procedure, we developed multiple AI multivariable prediction models for inpatient mortality (logistic regression, random forest, and XGBoost) using only patient age, sex, and ICD codes as features. We evaluated these models in the test set using area under the receiver operating curves (AUROC) and examined variable importance. Next, we determined the percentage of published multivariable prediction models using MIMIC that used ICD codes as features with a systematic literature review.

Results

The study cohort consisted of 180,640 patients (mean age 58.7 ranged from 18-103, 53.0% were female) and 8,573 (4.7%) died during the inpatient admission. The multivariable prediction models using ICD codes predicted in-hospital mortality with high performance in the test dataset (AUROCs: 0.97-0.98) across logistic regression, random forest, and XGBoost. The most important ICD codes were ‘brain death,’ ‘cardiac arrest’, ‘Encounter for palliative care’, and ‘Do Not resuscitate status’. The literature review found that 40.2% of studies using MIMIC to predict same-admission outcomes included ICD codes as features even though both MIMIC publications and documentation clearly state the ICD codes are derived after discharge.

Conclusions and relevance

Using ICD codes as features in same-admission prediction models is a severe methodological flaw that inflates performance metrics and renders the model incapable of making clinically useful predictions in real-time. Our literature review demonstrates that the practice is unfortunately common. Addressing this challenge is essential for advancing trustworthy AI in healthcare.

Key Points

Question

Do International Classification of Disease (ICD) diagnostic codes, which are only finalized after hospital discharge, artificially inflate the performance of AI healthcare prediction models?

Findings

In a systematic literature review, 40.2% of published models trained to predict same-admission outcomes on the benchmark MIMIC dataset use ICD codes as features, despite both MIMIC papers clearly stating these codes are only available after discharge. Prediction models for inpatient mortality trained on ICD codes alone in the MIMIC-IV dataset can predict in-hospital mortality with high accuracy (AUROCs: 0.97-0.98). The most important codes are not available in time for any clinically useful mortality prediction (e.g. “brain death” and “Encounter for palliative care”).

Meaning

ICD codes are frequently used in inpatient AI prediction models for outcomes during the same admission rendering their output clinically useless. To ensure AI models are both reliable and clinically deployable, greater diligence is needed in identifying and preventing label leakage.

Version published to 10.1101/2025.08.09.25333360 on medRxiv
Aug 13, 2025

Machine learning models for predicting severe clinical events in hospitalized patients with coronary artery disease

This article has 16 authors:
1. Hao Liu
2. Meijun Liu
3. Xinmiao Guan
4. Feng Cao
5. Changhao Liang
6. Zhongwen Qi
7. Jiaqi Hui
8. Junnan Zhao
9. Jingli Xing
10. Jianguo Zhou
11. Dong Zhang
12. Lei Liu
13. Xiaoliang Hao
14. Minjing Luo
15. Fengqin Xu
16. Yutong Fei
This article has no evaluationsLatest version Jan 12, 2026
Risk Stratification for In-Hospital Mortality in Alzheimer’s Disease Using Interpretable Regression and Explainable AI

This article has 3 authors:
1. Tursun Alkam
2. Ebrahim Tarshizi
3. Andrew H. Van Benschoten
This article has no evaluationsLatest version Jan 7, 2026
ICU Mortality and LOS Prediction Models Using MachineLearning Based on Both Real and Simulated Data

This article has 3 authors:
1. Girma Neshir Alemneh
2. Hirut Bekele Ashagrie
3. Lemlem Kassa Tegegne
This article has no evaluationsLatest version Jan 14, 2026