Unsupervised protein language models learn patterns of enzyme function

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

While enormous amounts of sequence information have become available, assignment of sequence to a particular enzymatic function has remained elusive. Here we describe a framework that drives a general protein language model to find a target reaction without specific training, using an initial bridgehead protein. At the heart of this framework is PLM-clust, an algorithm that employs k-means on top of protein language model embeddings to convert sequence space into functional reservoirs of latent space, and samples from these clusters based on accelerated zero-shot scoring. We demonstrate PLM-clust in a recursive discovery process (with enzyme hit rates quickly rising to >90%), segmenting isofunctional reservoirs and exploring them in greater detail. This approach – exemplified for glycosyl hydrolases (a xylanase, >100-fold activity increase) and for imine reductases (IREDs, >100-fold increase in catalytic promiscuity profiles) – reliably brings about novel enzymes that are proficient at the catalytic task at hand, reaching deeply into sequence space with a majority of residues exchanged.

Article activity feed