Transformer-based deep learning for multiclass viral classification in metagenomic sequencing

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Identifying viruses within metagenomic data remains a central challenge in computational biology. With the threat of infectious diseases, such as the recent COVID-19 pandemic, early and accurate pathogen detection becomes critical. Alignment-based methods such as BLAST are limited by their high computational cost and limited ability to detect highly divergent or novel viral genomes. While recent machine learning approaches have improved detection speed and sensitivity, most remain restricted to binary classification, distinguishing viral from non-viral sequences without finer taxonomic resolution. Convolutional neural networks (CNNs) enable multiclass prediction but are constrained in modeling long-range genomic dependencies. We present a hybrid architecture that integrates a convolutional preprocessing stage with transformer encoder layers to more efficiently capture global contextual relationships. Trained on clustered viral genomes from NCBI and evaluated using accuracy and F1-score, the model achieved 72% accuracy and an F1-score of 0.69—surpassing the CNN-based VirDetect-AI baseline. These results demonstrate that transformer-based architectures can generalize across thousands of viral classes, offering a scalable framework for multiclass viral classification. This approach advances metagenomic analysis by enabling rapid, fine-grained identification of viral diversity in environmental samples.

Article activity feed