Illuminating the Virosphere’s Dark Matter using Hierarchical Deep Learning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Systematic discovery of novel viruses is essential for pandemic preparedness, understanding tumor-associated viruses, developing viral delivery systems, and advancing biomedical applications. Yet, the majority of sequences in metagenomic datasets lack close relatives in existing references, representing a vast viral “dark matter” whose biology and evolution remain largely unknown. The central task is threefold: 1) to determine whether a genome is viral or non-viral, 2) to correctly assign viral genomes to known lineages when possible, and, critically, 3) to recognize when no existing lineage applies and thereby identify candidates for entirely novel viral groups. Existing approaches, which depend on sequence homology or narrow markers, struggle to capture this uncharted viral space. Here we present DeepVirus , a hierarchical transformer-based framework that models viral genomes as structured sequences of protein-coding genes. By combining protein-level embeddings from a foundation model with genome-aware representations, DeepVirus not only achieves accurate classification across deep taxonomic hierarchies, but also extends beyond conventional classification to detect and organize candidate novel viral lineages through open-set recognition. Applied to large-scale metagenomic resources, DeepVirus uncovered extensive viral diversity, including previously uncharacterized RNA-dependent RNA polymerases (RdRps), thereby expanding the known evolutionary space of RNA viruses. DeepVirus integrates deep learning with genome-aware open-set discovery to illuminate viral dark matter, providing a foundation for systematic viral taxonomy and advancing exploration of the global virosphere, with broad implications for safeguarding human health.