Ensemble Deep Learning Models on Raw DNA Sequences for Viral Genome Identification in Human Samples
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Detecting highly divergent or previously unknown viruses is a critical bottleneck in clinical diagnostics and pathogen surveillance. While alignment-based methods often fail to classify sequences lacking homology to known references, deep learning offers a powerful alternative for signal extraction from ‘viral dark matter.’ In this work, we present a high-performance ensemble of deep convolutional neural networks specifically designed to identify viral contigs in complex human metagenomic datasets. Our framework processes sequences acquired from high-throughput biological sensors and integrates complementary architectures to capture both local motifs and global genomic signatures. The proposed ensemble achieves state-of-the-art performance, reaching an AUROC of 0.939 on 300 bp contigs and significantly outperforming existing models such as transformer-based approaches, ViraMiner, and DeepVirFinder. Crucially, our results demonstrate high robustness to data degradation, maintaining stable predictive power even with a 10% random nucleotide substitution rate, a common challenge in degraded clinical samples. Furthermore, the model generalizes to ‘unseen’ viral families not present during training, demonstrating its utility for emerging threat detection. To ensure full reproducibility and facilitate further research in clinical sensing, the complete code and datasets are publicly available on Github.