Phenotyping using Structured Claims And Linked Electronic health records (PhenoSCALE): A semi-automated pipeline with an example of acute kidney injury

Richeek Pradhan
Joyce Lii
Shirley V. Wang
Robert Ball
Rishi J. Desai

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objectives

Probabilistic phenotyping of health events has focused on unstructured or laboratory-based electronic health records (EHR), though drug surveillance still mostly relies on claims databases. Using acute kidney injury (AKI) as a case-example, we aimed to develop a semi-automated probabilistic phenotyping workflow using claims data.

Materials and methods

We defined highly sensitive “bronze” AKI events using ICD-10 codes and more specific “silver” events through two or more ICD-10 codes within 10 days of the bronze event. Data-driven feature selection identified co-occurring claims and the least absolute shrinkage and selection operator (LASSO) model predicting the silver labels was used to reduce the feature space and develop the final algorithms. These models were then validated against creatinine-based “gold” events derived from linked EHRs in a 20% hold out testing sample and their performance reported using area under the receiver operating characteristics curve (AUROC) and precision-recall curve (AUPRC).

Results

A total of 2144 features were identified based on co-occurrence with the AKI ICD codes. LASSO identified a set of 36 candidate features as most predictive of the silver labels, of which 7 were manually removed. The final phenotyping model with 29 claims-based features had an AUROC of 0.92 in the testing sample, compared to AUROC of 0.77 for a rule-based approach requiring 2 ICD codes for AKI. The AUPRC of the phenotyping model was 0.52.

Discussion

Phenotyping algorithm developed using our semi-automated workflow outperformed rule-based approach for identifying AKI.

Conclusion

This workflow may hold promise for broader application in phenotyping large-scale health data.

Lay Summary

Understanding whether a patient has experienced a medical event like acute kidney injury (AKI) is not always straightforward when using health insurance claims data. Traditionally, researchers use simple rules—such as the presence of diagnostic codes—to identify such events, but these methods can miss cases or include incorrect ones. In this study, we developed a new semi-automated approach to improve the identification of AKI using only claims data. We first labeled potential AKI events using diagnostic codes and then used a statistical method to select other claims records that frequently occurred around these events. These patterns were used to build a prediction model, which we tested against lab-confirmed cases from linked medical records. Our model correctly identified AKI events with high accuracy and performed significantly better than the traditional rule-based method. This approach could help researchers and public health agencies more accurately identify health events in large datasets, especially when lab results or detailed clinical notes are not available.

Version published to 10.1101/2025.10.23.25338672 on medRxiv
Oct 24, 2025

Whose Truth Is Ground Truth?: Consequences of Label Choice on ML Models

This article has 2 authors:
1. Natasha April Tonge
2. Leah Adams
This article has no evaluationsLatest version Feb 5, 2026
A Hybrid Pharmacovigilance Method for National-Scale Comorbidity Discovery: Association Rules with FDA-Approved PRR/Chi-square and EBGM Validation.

This article has 1 author:
1. Kaossara Osseni
This article has no evaluationsLatest version Dec 24, 2025
Personalized Disease Risk Prediction from Multimodal Health Data Using Large Language Models

This article has 2 authors:
1. Hanieh Arjmand
2. Alexandre Tomberg
This article has no evaluationsLatest version Jan 25, 2026

Discuss this preprint

Listed in

Abstract

Objectives

Materials and methods

Results

Discussion

Conclusion

Lay Summary

Article activity feed

Related articles

Whose Truth Is Ground Truth?: Consequences of Label Choice on ML Models

A Hybrid Pharmacovigilance Method for National-Scale Comorbidity Discovery: Association Rules with FDA-Approved PRR/Chi-square and EBGM Validation.

Personalized Disease Risk Prediction from Multimodal Health Data Using Large Language Models