Phenotyping using Structured Claims And Linked Electronic health records (PhenoSCALE): A semi-automated pipeline with an example of acute kidney injury
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objectives
Probabilistic phenotyping of health events has focused on unstructured or laboratory-based electronic health records (EHR), though drug surveillance still mostly relies on claims databases. Using acute kidney injury (AKI) as a case-example, we aimed to develop a semi-automated probabilistic phenotyping workflow using claims data.
Materials and methods
We defined highly sensitive “bronze” AKI events using ICD-10 codes and more specific “silver” events through two or more ICD-10 codes within 10 days of the bronze event. Data-driven feature selection identified co-occurring claims and the least absolute shrinkage and selection operator (LASSO) model predicting the silver labels was used to reduce the feature space and develop the final algorithms. These models were then validated against creatinine-based “gold” events derived from linked EHRs in a 20% hold out testing sample and their performance reported using area under the receiver operating characteristics curve (AUROC) and precision-recall curve (AUPRC).
Results
A total of 2144 features were identified based on co-occurrence with the AKI ICD codes. LASSO identified a set of 36 candidate features as most predictive of the silver labels, of which 7 were manually removed. The final phenotyping model with 29 claims-based features had an AUROC of 0.92 in the testing sample, compared to AUROC of 0.77 for a rule-based approach requiring 2 ICD codes for AKI. The AUPRC of the phenotyping model was 0.52.
Discussion
Phenotyping algorithm developed using our semi-automated workflow outperformed rule-based approach for identifying AKI.
Conclusion
This workflow may hold promise for broader application in phenotyping large-scale health data.
Lay Summary
Understanding whether a patient has experienced a medical event like acute kidney injury (AKI) is not always straightforward when using health insurance claims data. Traditionally, researchers use simple rules—such as the presence of diagnostic codes—to identify such events, but these methods can miss cases or include incorrect ones. In this study, we developed a new semi-automated approach to improve the identification of AKI using only claims data. We first labeled potential AKI events using diagnostic codes and then used a statistical method to select other claims records that frequently occurred around these events. These patterns were used to build a prediction model, which we tested against lab-confirmed cases from linked medical records. Our model correctly identified AKI events with high accuracy and performed significantly better than the traditional rule-based method. This approach could help researchers and public health agencies more accurately identify health events in large datasets, especially when lab results or detailed clinical notes are not available.