RAG-ESM: Improving pretrained protein language models via sequence retrieval

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Protein language models are significantly advancing the modeling of sequence-function relationships. However, most of them are not directly informed of homology and evolutionary relationships between protein sequences. Here, we propose a method to make them homology-aware. We introduce RAG‐ESM, a retrieval‐augmented framework that allows to condition pretrained ESM2 protein language models on homologous sequences, using a minimal number of additional cross‐attention parameters and minimal computational cost. We show that RAG‐ESM models outperform larger ESM2 models for masked amino acid prediction. We find that sequence alignment capabilities spontaneously emerge in specific cross‐attention heads of RAG-ESM. By using a discrete diffusion objective for training, and by conditioning on homologs during inference, RAG‐ESM reaches state-of-the-art performance for conditional protein sequence generation and motif scaffolding, among sequence-based models. Our method thus possesses strong potential for scalable, efficient and controlled protein engineering.

Article activity feed