Predicting the Evolutionary and Functional Landscapes of Viruses with a Unified Nucleotide-Protein Language Model: LucaVirus
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Predicting viral evolution and function remains a central challenge in biology, hindered by high sequence divergence and limited knowledge compared to cellular organisms. Here, we introduce LucaVirus, a multi-modal foundation model for viruses, trained on 25.4 billion nucleotide and amino acid tokens covering nearly all known viruses. LucaVirus learns biologically meaningful representations capturing relationships between sequences, protein/gene homology, and evolutionary divergence. Using these embeddings, we developed downstream models that address key virology tasks: identifying hidden viruses in genomic “dark matter”, annotating enzymatic activities of uncharacterized proteins, predicting viral evolvability, and identifying antibody candidates for emerging viruses. LucaVirus achieves state-of-the-art results in three tasks and matches leading models in the fourth with one-third the parameters. Together, these findings demonstrate the power of a unified foundation model to comprehensively decode the viral world and establish LucaVirus as an efficient and versatile platform for AI-driven virology, from virus discovery to functional and therapeutic predictions.