ExactCN: Predicting Exact Copy Numbers on Whole Exome Sequencing Data

Erfan FarhangKia
Ahmet Arda Ceylan
Mert Gencturk
Mehmet Alper Yilmaz
Furkan Karademir
A. Ercument Cicek

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The quantification of the precise copy number variations (CNVs) is crucial to understanding the effects of gene dosage, disease severity, and therapeutic response. Although whole-exome sequencing(WES) offers a cost-effective solution for CNV detection in a clinical setting, it introduces several biases, including those related to sequence length, GC content, and the use of targeting probes. Consequently, estimating exact copy numbers remains challenging, especially for WES data. Here, we present ExactCN, a deep learning-based method for estimation of exact copy numbers from WES data per exon. The architecture integrates convolutional layers that extract local read-depth patterns with transformer encoder blocks that capture genomic context and handle sequencing noise. ExactCN is trained on WES samples from the 1000 Genomes Project, using matching WGS-based calls as semi-ground truth. In benchmarks, ExactCN improves the state-of-the-art integer CNV calling performance by reducing the macro-averaged mean absolute error (MAE) from 0.91 to 0.62 and the macro-averaged root mean squared error (RMSE) from 1.31 to 0.78. It also achieves an overall Pearson correlation of 0.669 and Spearman correlation of 0.550, improving the second-best method by 0.641 and 0.482, respectively. Furthermore, a fine-tuned and specialized version of ExactCN for aggregate CNV detection in clinically important duplicated genes SMN1/2 achieved a macro averaged F1-score of 0.657, and mean absolute error of 0.3. These results substantially improves the state-of-the-art performance and demonstrates the model's applicability to both research and clinical genomic analyses. ExactCN is available at https://github.com/ciceklab/ExactCN.

Version published to 10.1101/2025.11.24.690086 on bioRxiv
Nov 26, 2025

CHALLENGER: Detecting Copy Number Variants in Challenging Regions Using Whole Genome Sequencing Data

This article has 3 authors:
1. Mehmet Alper Yilmaz
2. Ahmet Arda Ceylan
3. A. Ercument Cicek
This article has no evaluationsLatest version Nov 26, 2025
Comprehensive benchmarking of somatic single-nucleotide variant and indel detection at ultra-low allele fractions using short- and long-read data

This article has 46 authors:
1. Yoo-Jin Jiny Ha
2. Dominika Maziec
3. Julia Markowski
4. Stephanie J. Georges
5. Nancy L. Parmalee
6. Michele Berselli
7. Tim H.H. Coorens
8. Shihua Dong
9. Stephanie Gardiner
10. Divya Kalra
11. Daofeng Li
12. Benpeng Miao
13. Rajeeva Musunuri
14. Liying Xue
15. Zhi Yu
16. Kimberly Walker
17. Lisa Anderson
18. Natalie Y.T. Au
19. Carrie Cibulskis
20. Harsha Doddapaneni
21. Christopher M. Grochowski
22. Dana M. Jensen
23. Tina Lindsay
24. Kelsey Loy
25. Azeet Narayan
26. Giuseppe Narzisi
27. Jeffrey Ou
28. Meranda M. Pham
29. Alexi M. Runnels
30. Andrew B. Stergachis
31. Lila M. Sutherlin
32. Ting Wang
33. Hu Jin
34. William C. Feng
35. Yuwei Zhang
36. Alexander D. Veit
37. Clara TaeHee Kim
38. Hye-Jung E. Chun
39. SMaHT Network Single Nucleotide Variant (SNV) Working Group
40. Kristin Ardlie
41. Robert S. Fulton
42. Soren Germer
43. Richard Gibbs
44. Gabor T. Marth
45. James T. Bennett
46. Peter J. Park
This article has no evaluationsLatest version Oct 14, 2025
PRSformer: Disease Prediction from Million-Scale Individual Genotypes

This article has 5 authors:
1. Payam Dibaeinia
2. Chris German
3. Suyash Shringarpure
4. Adam Auton
5. Aly A. Khan
This article has no evaluationsLatest version Oct 27, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

CHALLENGER: Detecting Copy Number Variants in Challenging Regions Using Whole Genome Sequencing Data

Comprehensive benchmarking of somatic single-nucleotide variant and indel detection at ultra-low allele fractions using short- and long-read data

PRSformer: Disease Prediction from Million-Scale Individual Genotypes