Distilling Structural Representations into Protein Sequence Models

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Protein language models, like the popular ESM2, are widely used tools for extracting evolution-based protein representations and have achieved significant success on downstream biological tasks. Representations based on sequence and structure models, however, show significant performance differences depending on the downstream task. A major open problem is to obtain representations that best capture both the evolutionary and structural properties of proteins in general. Here we introduce I mplicit S tructure M odel ( ISM ), a sequence-only input model with structurally-enriched representations that outperforms state-of-the-art sequence models on several well-studied benchmarks including mutation stability assessment and structure prediction. Our key innovations are a microenvironment-based autoencoder for generating structure tokens and a self-supervised training objective that distills these tokens into ESM2’s pre-trained model. We have made ISM ’s structure-enriched weights easily available: integrating ISM into any application using ESM2 requires changing only a single line of code. Our code is available at https://github.com/jozhang97/ISM .

Article activity feed

  1. Structure-tuning is a fine-tuning technique where a sequence-only model is trained to predict structuretokens – rather than masked amino acids – for each protein residue

    Is this technique novel? This seems like a good approach for adding in other features that can be relatively predicted using sequence only. Are there any plans to do that?