CHALLENGER: Detecting Copy Number Variants in Challenging Regions Using Whole Genome Sequencing Data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Copy number variation (CNV) detection remains a major challenge in whole-genome sequencing (WGS) data, particularly within repetitive, duplicated, and camouflaged genomic regions where short-read sequencing (srWGS) often fails to produce confident alignments. Although long-read WGS (lrWGS) substantially improves structural variant resolution, its high cost limits widespread adoption, especially in clinical settings. To address these limitations, we introduce CHALLENGER, a masked language modeling-based approach for clinical CNV detection using short-read depth signals over coding regions. While the model uses only short-read data as input, it can make calls typically accessible only with long reads, providing a cost-effective way to obtain information characteristic of both technologies. The model is pre-trained on semi-ground truth calls made on srWGS data and then fine-tuned using (i) lrWGS-derived, (ii) human expert-labeled, and (iii) experimentally validated CNV call sets, enabling it to learn technology- and labeling strategy-specific variant signatures hidden within srWGS profiles and to operate in challenging genomic regions. We show that our short-read-only approach improves the state-of-the-art CNV detection F1-score by 40.8%, while, for the first time, capturing 80.3% of CNVs that can only be detected using long reads in challenging genomic regions. The improvement in F1-score in the set of human experts calls is 70.5% for duplications, and 24.6% for deletions in challenging genes. We also specialize CHALLENGER on paralog genes SMN1/2, AMY1/2, and NPY4R, and show that it can improve the performance on experimentally validated call sets while being able to make paralog-specific calls in addition to aggregate calls. The CHALLENGER code and model are available at https://github.com/ciceklab/CHALLENGER.

Article activity feed