CanBART: A Generative Foundation Model of Cancer Molecular Alterations for Synthetic Patient Generation and Genomic Profile Completion
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Despite the rapid expansion of genomic profiling in oncology, real-world datasets remain limited in size and unevenly distributed, particularly for rare cancers and underrepresented patient populations. The number of available sequenced cases is often insufficient for robust machine learning, and even within existing cohorts, sequencing panels differ in gene coverage. As a result, crucial molecular features are frequently missing, making it difficult to compare cohorts, identify actionable alterations, or enroll patients into clinical trials. These data gaps limit the utility of precision oncology and drug discovery.
To address these challenges, we introduce CanBART – a generative foundation model trained on sequencing data from 144,000 patients. CanBART represents somatic alterations as tokenized sequences and learns to reconstruct missing genomic features, classify tumor types, and generate synthetic patient cohorts. The model employs a BART-style masked language architecture, enabling flexible inference, imputation, and biologically informed data augmentation.
CanBART was applied to impute missing gene statuses in real-world datasets. As an illustrative use case, in the OncoPanel dataset (DFCI), it successfully predicted over one-third of missing mutations with high confidence. Using sampling strategies adapted from NLP, we generated “plausible patients” – synthetic genomic profiles that extend the training data with biologically coherent examples. These synthetic cohorts were biologically validated and used to train tumor-type classifiers, improving accuracy for two-thirds of cancer types, particularly rare ones represented by only 20-500 samples.
By unifying fragmented genomic data and enabling the generation of plausible profiles, CanBART provides a scalable tool for precision oncology – with applications in rare cancer research, multi-site data harmonization, and clinical trial optimization. Moreover, it generates realistic “virtual patients” that reproduce real-world co-alteration patterns and can be used to simulate clinical trial eligibility.