Genome modeling and design across all domains of life with Evo 2

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

All of life encodes information with DNA. While tools for sequencing, synthesis, and editing of genomic code have transformed biological research, intelligently composing new biological systems would also require a deep understanding of the immense complexity encoded by genomes. We introduce Evo 2, a biological foundation model trained on 9.3 trillion DNA base pairs from a highly curated genomic atlas spanning all domains of life. We train Evo 2 with 7B and 40B parameters to have an unprecedented 1 million token context window with single-nucleotide resolution. Evo 2 learns from DNA sequence alone to accurately predict the functional impacts of genetic variation—from noncoding pathogenic mutations to clinically significant BRCA1 variants—without task-specific finetuning. Applying mechanistic interpretability analyses, we reveal that Evo 2 autonomously learns a breadth of biological features, including exon–intron boundaries, transcription factor binding sites, protein structural elements, and prophage genomic regions. Beyond its predictive capabilities, Evo 2 generates mitochondrial, prokaryotic, and eukaryotic sequences at genome scale with greater naturalness and coherence than previous methods. Guiding Evo 2 via inference-time search enables controllable generation of epigenomic structure, for which we demonstrate the first inference-time scaling results in biology. We make Evo 2 fully open, including model parameters, training code, inference code, and the OpenGenome2 dataset, to accelerate the exploration and design of biological complexity.

Article activity feed

  1. The results for non coding variants are particularly encouraging. Given how fast such sequences evolve this seems like a space where models like EVO2 might actually be constrained into learning more fundamental biological patterns as conservation is less apparent.

  2. as preliminary analysis indicated that most features of interest were represented at this point

    It doesn't seem that Fig S5 shows how layer 26 was selected. It would be interesting to at least get a short description in the methods of how this layer was chosen. Other work on mechanistic interpretability in protein language models has shows that different types of features can be learned in different layers of the model.

  3. Together, these results highlight the competitive performance of Evo 2 in predicting the pathogenic effects of human coding SNVs

    As an evolutionary geneticist to me the most interesting benchmark here are the PhyloP scores. When I see models like EVO2 my concern is always that they are able to effectively memorise phylogenetic conservation. This is totally valid from a biological standpoint however, this can be done with a far simpler phylogenetically explicit method like PhyloP, GERP etc. What is far more exciting is the possibility that a flexible, large model like EVO2 could pick up on non-linear (e.g epistatic) patterns which is something PhyloP type methods are blind to. That PhyloP is very competitive in all these tasks I think is quite telling that for the most part the power of all these models comes from identifying conservation rather than more general 'biological rues'. However that in some instances PhyloP can be improved upon is very exciting nonetheless, in my opinion this is the golden benchmark to be trying to beat.

  4. These values were then used as a predictive variable in a logistic regression model of gene essentiality, and directly compared to simple genetic metrics such as GC content and transcript length. Gene age values from the original lncRNA essentiality study (Sarropoulos et al., 2019) were used where available as an additional control.

    Aside from NT, these alternative metrics of lncRNA essentiality seem over simplistic compared to a model as complex as EVO2. Are there no other alternative models for lncRNA essentiality? Maybe a tweak of sequence conservation methods could work here too.