Longitudinal Masked Representation Learning for Pulmonary Nodule Diagnosis from Language Embedded EHRs
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Electronic health records (EHRs) are a rich source of clinical data, yet exploiting longitudinal signals for pulmonary nodule diagnosis remains challenging due to the administrative noise and high level of clinical abstraction present in these records. Because of this complexity, classification models are prone to overfitting when labeled data is scarce. This study explores masked representation learning (MRL) as a strategy to improve pulmonary nodule diagnosis by modeling longitudinal EHRs across multiple modalities: clinical conditions, procedures, and medications. We leverage a web-scale text embedding model to encode EHR event streams into semantically embedded sequences. We then pretrain a bidirectional transformer using MRL conditioned on time encodings on a large cohort of general pulmonary conditions from our home institution. Evaluation on a cohort of diagnosed pulmonary nodules demonstrates significant improvement in diagnosis accuracy with a model finetuned from MRL (0.781 AUC, 95% CI: [0.780, 0.782]) compared to a supervised model with the same architecture (0.768 AUC, 95% CI: [0.766, 0.770]) when integrating all three modalities. These findings suggest that language-embedded MRL can facilitate downstream clinical classification, offering potential advancements in the comprehensive analysis of longitudinal EHR modalities.