Rapid protein evolution by few-shot learning with a protein language model

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Directed evolution of proteins is critical for applications in basic biological research, therapeutics, diagnostics, and sustainability. However, directed evolution methods are labor intensive, cannot efficiently optimize over multiple protein properties, and are often trapped by local maxima. In silico- directed evolution methods incorporating protein language models (PLMs) have the potential to accelerate this engineering process, but current approaches fail to generalize across diverse protein families. We introduce EVOLVEpro, a few-shot active learning framework to rapidly improve protein activity using a combination of PLMs and protein activity predictors, achieving improved activity with as few as four rounds of evolution. EVOLVEpro substantially enhances the efficiency and effectiveness of in silico protein evolution, surpassing current state-of-the-art methods and yielding proteins with up to 100-fold improvement of desired properties. We showcase EVOLVEpro for five proteins across three applications: T7 RNA polymerase for RNA production, a miniature CRISPR nuclease, a prime editor, and an integrase for genome editing, and a monoclonal antibody for epitope binding. These results demonstrate the advantages of few-shot active learning with small amounts of experimental data over zero-shot predictions. EVOLVEpro paves the way for broader applications of AI-guided protein engineering in biology and medicine.

Article activity feed

  1. A schematic showing the evolution of higher activity variants with EVOLVEpro. The mutagenesis landscape of proteins is often conceptualized as a complex terrain with numerous potential paths. Shown here is a gray road that conceptualizes the protein mutagenesis landscape where traversing upwards results in higher protein activity and traversing downwards reduces protein fitness. Traditional frameworks of evolutionary plausibility attempt to navigate this terrain based on natural selection, which is constrained by historical and environmental factors.

    In the manuscript, "fitness" generally refers to landscapes learned by pLMs. However, at other times, it is used to describe the actual landscapes traversed by evolution (via processes like natural selection). Given the limitations of pLMs - including those you cover in the introduction - it feels dangerous to conflate these two. It is far from established that language models are able to infer the true structure of evolutionary processes, much less model the complex activities of natural selection.

    This feels important to note since the discontinuity between fitness and trait distributions has been recognized for a long time (e.g. Fisher 1930). Many factors contribute to this relationship, both at the individual gene/protein level and at the level of genetic interactions. It is likely that variation in relationships between pLM fitness/activity will also be affected by multiple such factors (as evidenced by the differences observed even here across the 5 proteins of focus). It is also likely that these will at least somewhat differ from the factors influencing empirical fitness landscapes. Delineating these differences clearly seems to be useful for future model development/refinement.