Distilling Structural Representations into Protein Sequence Models
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
Protein language models, like the popular ESM2, are widely used tools for extracting evolution-based protein representations and have achieved significant success on downstream biological tasks. Representations based on sequence and structure models, however, show significant performance differences depending on the downstream task. A major open problem is to obtain representations that best capture both the evolutionary and structural properties of proteins in general. Here we introduce I mplicit S tructure M odel ( ISM ), a sequence-only input model with structurally-enriched representations that outperforms state-of-the-art sequence models on several well-studied benchmarks including mutation stability assessment and structure prediction. Our key innovations are a microenvironment-based autoencoder for generating structure tokens and a self-supervised training objective that distills these tokens into ESM2’s pre-trained model. We have made ISM ’s structure-enriched weights easily available: integrating ISM into any application using ESM2 requires changing only a single line of code. Our code is available at https://github.com/jozhang97/ISM .
Article activity feed
-
Structure-tuning is a fine-tuning technique where a sequence-only model is trained to predict structuretokens – rather than masked amino acids – for each protein residue
Is this technique novel? This seems like a good approach for adding in other features that can be relatively predicted using sequence only. Are there any plans to do that?
-
-