EHRs Enable Robust Lung Cancer Risk Stratification with Transformer-based Models: A Retrospective Multi-center Validation Study

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Early detection of lung cancer is challenging, and current screening eligibility relies on costly, difficult-to-scale questionnaires. We developed and validated risk stratification models using routinely collected longitudinal structured Electronic Health Records (EHRs) to support population-level screening and evaluation. In this retrospective, multicentre study, we trained four AI models, comparing non-temporal approaches (Count-Based Logistic Regression and time-agnostic Transformer) with temporal sequence modeling approaches (LSTM network and time-aware Transformer). External validation was performed on two independent cohorts from Osakidetza (26,348 individuals from Spain) and the University Hospital of Liège (33,576 individuals from Belgium), evaluating external validity and screening efficiency. The time-aware transformer model (STraTS_t) was the top performer (AUROC 0.809) in the Andalusian Health Service training cohort (202,830 individuals from Spain). Its performance was robustly preserved during sequential external validation (Osakidetza AUROC 0.794; Liège AUROC 0.743). STraTS_t also showed superior screening efficiency, requiring only 26.54% of the population to be screened to detect 70% of lung cancer cases, compared to 41.01% for the baseline CB model. Our findings demonstrate that structured routine EHRs and time-aware transformers deliver accurate, robust lung-cancer risk stratification sustained across distinct European health systems. This capability makes a strong case for screening approaches that are cost- and time-efficient, suitable for population-level deployment without requiring new data collection.

Article activity feed