ExactCN: Predicting Exact Copy Numbers on Whole Exome Sequencing Data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The quantification of the precise copy number variations (CNVs) is crucial to understanding the effects of gene dosage, disease severity, and therapeutic response. Although whole-exome sequencing(WES) offers a cost-effective solution for CNV detection in a clinical setting, it introduces several biases, including those related to sequence length, GC content, and the use of targeting probes. Consequently, estimating exact copy numbers remains challenging, especially for WES data. Here, we present ExactCN, a deep learning-based method for estimation of exact copy numbers from WES data per exon. The architecture integrates convolutional layers that extract local read-depth patterns with transformer encoder blocks that capture genomic context and handle sequencing noise. ExactCN is trained on WES samples from the 1000 Genomes Project, using matching WGS-based calls as semi-ground truth. In benchmarks, ExactCN improves the state-of-the-art integer CNV calling performance by reducing the macro-averaged mean absolute error (MAE) from 0.91 to 0.62 and the macro-averaged root mean squared error (RMSE) from 1.31 to 0.78. It also achieves an overall Pearson correlation of 0.669 and Spearman correlation of 0.550, improving the second-best method by 0.641 and 0.482, respectively. Furthermore, a fine-tuned and specialized version of ExactCN for aggregate CNV detection in clinically important duplicated genes SMN1/2 achieved a macro averaged F1-score of 0.657, and mean absolute error of 0.3. These results substantially improves the state-of-the-art performance and demonstrates the model's applicability to both research and clinical genomic analyses. ExactCN is available at https://github.com/ciceklab/ExactCN.