Diagnostic Codes in AI prediction models and Label Leakage of Same-admission Clinical Outcomes
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Importance
Artificial intelligence (AI) and statistical models designed to predict same-admission outcomes for hospitalized patients, such inpatient mortality, often rely on International Classification of Disease (ICD) diagnostic codes, even when these codes are not finalized until after hospital discharge.
Objective
Investigate the extent to which the inclusion of ICD codes as features in predictive models inflates performance metrics via “label leakage” (e.g. including the ICD code for cardiac arrest into an inpatient mortality prediction model) and assess the prevalence and implications of this practice in existing literature.
Design
Observational study of the MIMIC-IV deidentified inpatient electronic health record database and literature review.
Setting
Beth Israel Deaconess Medical Center.
Participants
Patients admitted to the hospital with either emergency room or ICU between 2008 and 2019
Main outcome and measures
Using a standard training-validation-test split procedure, we developed multiple AI multivariable prediction models for inpatient mortality (logistic regression, random forest, and XGBoost) using only patient age, sex, and ICD codes as features. We evaluated these models in the test set using area under the receiver operating curves (AUROC) and examined variable importance. Next, we determined the percentage of published multivariable prediction models using MIMIC that used ICD codes as features with a systematic literature review.
Results
The study cohort consisted of 180,640 patients (mean age 58.7 ranged from 18-103, 53.0% were female) and 8,573 (4.7%) died during the inpatient admission. The multivariable prediction models using ICD codes predicted in-hospital mortality with high performance in the test dataset (AUROCs: 0.97-0.98) across logistic regression, random forest, and XGBoost. The most important ICD codes were ‘brain death,’ ‘cardiac arrest’, ‘Encounter for palliative care’, and ‘Do Not resuscitate status’. The literature review found that 40.2% of studies using MIMIC to predict same-admission outcomes included ICD codes as features even though both MIMIC publications and documentation clearly state the ICD codes are derived after discharge.
Conclusions and relevance
Using ICD codes as features in same-admission prediction models is a severe methodological flaw that inflates performance metrics and renders the model incapable of making clinically useful predictions in real-time. Our literature review demonstrates that the practice is unfortunately common. Addressing this challenge is essential for advancing trustworthy AI in healthcare.
Key Points
Question
Do International Classification of Disease (ICD) diagnostic codes, which are only finalized after hospital discharge, artificially inflate the performance of AI healthcare prediction models?
Findings
In a systematic literature review, 40.2% of published models trained to predict same-admission outcomes on the benchmark MIMIC dataset use ICD codes as features, despite both MIMIC papers clearly stating these codes are only available after discharge. Prediction models for inpatient mortality trained on ICD codes alone in the MIMIC-IV dataset can predict in-hospital mortality with high accuracy (AUROCs: 0.97-0.98). The most important codes are not available in time for any clinically useful mortality prediction (e.g. “brain death” and “Encounter for palliative care”).
Meaning
ICD codes are frequently used in inpatient AI prediction models for outcomes during the same admission rendering their output clinically useless. To ensure AI models are both reliable and clinically deployable, greater diligence is needed in identifying and preventing label leakage.