CHALLENGER: Detecting Copy Number Variants in Challenging Regions Using Whole Genome Sequencing Data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Copy number variation (CNV) detection remains a major challenge in whole-genome sequencing (WGS) data, particularly within repetitive, duplicated, and camouflaged genomic regions where short-read sequencing (srWGS) often fails to produce confident alignments. Although long-read WGS (lrWGS) substantially improves structural variant resolution, its high cost limits widespread adoption, especially in clinical settings. To address these limitations, we introduce CHALLENGER, a masked language modeling-based approach for clinical CNV detection using short-read depth signals over coding regions. While the model uses only short-read data as input, it can make calls typically accessible only with long reads, providing a cost-effective way to obtain information characteristic of both technologies. The model is pre-trained on semi-ground truth calls made on srWGS data and then fine-tuned using (i) lrWGS-derived, (ii) human expert-labeled, and (iii) experimentally validated CNV call sets, enabling it to learn technology- and labeling strategy-specific variant signatures hidden within srWGS profiles and to operate in challenging genomic regions. We show that our short-read-only approach improves the state-of-the-art CNV detection F1-score by 40.8%, while, for the first time, capturing 80.3% of CNVs that can only be detected using long reads in challenging genomic regions. The improvement in F1-score in the set of human experts calls is 70.5% for duplications, and 24.6% for deletions in challenging genes. We also specialize CHALLENGER on paralog genes SMN1/2, AMY1/2, and NPY4R, and show that it can improve the performance on experimentally validated call sets while being able to make paralog-specific calls in addition to aggregate calls. The CHALLENGER code and model are available at https://github.com/ciceklab/CHALLENGER.