Anonymized Somatic Tumor Twins (STTs) enable open genome data sharing and use in research and clinical oncology
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The study of somatic variants from tumor genomes is fundamental to cancer research and clinical decision-making. However, existing data protection frameworks impose restrictions on the use and sharing of these variants in conjunction with sensitive germline information. To overcome these challenges, we developed GenomeAnonymizer, the first method to anonymize short-read DNA sequences from tumor-normal pairs. This generates Somatic Tumor Twins (STTs), an anonymized version of the original data that preserves the donor’s privacy while retaining somatic tumor information and sequencing noise. This method successfully removed all detectable germline variants from the 47 PCAWG-Pilot samples. We further demonstrate that Whole-Genome Sequencing (WGS) STTs preserve more than 98% of the original somatic variants, enabling reliable downstream analysis that replicates somatic-related findings from the original samples, including cancer driver genes, mutational signatures, and intratumor heterogeneity. Importantly, we also show that STTs can reproduce the identification of actionable genes and downstream clinical interpretations and decision-making. We generated a cancer cohort of STTs matched with synthetic clinical data that could be openly shared and used across projects and centers worldwide. This paradigm-shifting approach will accelerate discovery and clinical translation in oncology and enable the robust benchmarking of genome analysis and large-scale data infrastructures.