Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Predicting viral evolution and function remains a central challenge in biology, hindered by high sequence divergence and limited knowledge compared to cellular organisms. Here, we introduce LucaVirus, a multi-modal foundation model for viruses, trained on 25.4 billion nucleotide and amino acid tokens covering nearly all known viruses. LucaVirus learns biologically meaningful representations capturing relationships between sequences, protein/gene homology, and evolutionary divergence. Using these embeddings, we developed downstream models that address key virology tasks: identifying hidden viruses in genomic “dark matter”, annotating enzymatic activities of uncharacterized proteins, predicting viral evolvability, and identifying antibody candidates for emerging viruses. LucaVirus achieves state-of-the-art results in three tasks and matches leading models in the fourth with one-third the parameters. Together, these findings demonstrate the power of a unified foundation model to comprehensively decode the viral world and establish LucaVirus as an efficient and versatile platform for AI-driven virology, from virus discovery to functional and therapeutic predictions.

Article activity feed