PANCDetect: Early Detection of Pancreatic Cancer from Multi-modal EHR data with LLM Embeddings
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Pancreatic cancer (PANC) is often diagnosed at late stages due to the absence of specific early symptoms, resulting in one of the highest cancer mortality rates. While imaging modalities such as MRI and CT offer high diagnostic accuracy, their population-wide application is however impractical due to the cost. Electronic health records (EHRs) provide a routine, easily accessible, longitudinal and scalable data source for risk prediction, particularly for diseases with no specific symptom such as PANC. Method We introduce PANCDetect, a multimodal framework that leverages large language model (LLM)-derived embeddings of diagnoses, procedures, medications, and laboratory tests, and integrates these data modalities through a Transformer-based architecture. We train model on MarketScan (≈250M patients), and validate it externally on additional large real-world EHR datasets of University of Michigan Precision Health, or UMPH data (n≈6M patients) and OneFlorida+ data (n≈26M patients).We then fine-tuned the general model on UMPH EHR data . We evaluated performance of both models using metrics including area under the receiver operating characteristic curve (AUROC) and area under the precision-recall-gain curve (AUPRG). We assessed the top predictive features with integrated gradients (IG). Result In the MarketScan cohort, PANCDetect achieved an AUROC of 0.812 and AUPRG of 0.851 at the 6-month prediction window, and an AUROC of 0.735 and AUPRG of 0.629 for 60-month prediction, significantly outperforming CancerRiskNet. External validation on UMPH and OneFlorida+ demonstrated good generalizability, with 6-months AUROC scores of 0.7111 and 0.793 respectively. Fine-tuning on UMPH with laboratory data further improved performance, reaching an AUROC of 0.927 and an AUPRG of 0.979 at 6 months. Even at the 60-month horizon, the refined PANCDetect model maintained strong performance, with an AUROC of 0.927 and AUPRG of 0.979. Attribution analysis highlighted type 2 diabetes, pancreatic diseases, personal and family cancer history as the most important risk factors. Conclusion PANCDetect is the state of the art method integrating multimodal EHR data with LLM embeddings for accurate, interpretable, and generalizable early prediction of pancreatic cancer. This framework holds promise for precision screening of high-risk patients, with the potential to improve survival outcomes without increasing healthcare costs.