Distilling Structural Representations into Protein Sequence Models

Jeffrey Ouyang-Zhang
Chengyue Gong
Yue Zhao
Philipp Krähenbühl
Adam R. Klivans
Daniel J. Diaz

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Protein language models, like the popular ESM2, are widely used tools for extracting evolution-based protein representations and have achieved significant success on downstream biological tasks. Representations based on sequence and structure models, however, show significant performance differences depending on the downstream task. A major open problem is to obtain representations that best capture both the evolutionary and structural properties of proteins in general. Here we introduce I mplicit S tructure M odel ( ISM ), a sequence-only input model with structurally-enriched representations that outperforms state-of-the-art sequence models on several well-studied benchmarks including mutation stability assessment and structure prediction. Our key innovations are a microenvironment-based autoencoder for generating structure tokens and a self-supervised training objective that distills these tokens into ESM2’s pre-trained model. We have made ISM ’s structure-enriched weights easily available: integrating ISM into any application using ESM2 requires changing only a single line of code. Our code is available at https://github.com/jozhang97/ISM .

Arcadia Science
Mar 1, 2025

Structure-tuning is a fine-tuning technique where a sequence-only model is trained to predict structuretokens – rather than masked amino acids – for each protein residue

Is this technique novel? This seems like a good approach for adding in other features that can be relatively predicted using sequence only. Are there any plans to do that?

Read the original source
Version published to 10.1101/2024.11.08.622579 on bioRxiv
Nov 11, 2024

FlexRibbon: Joint Sequence and Structure Pretraining for Protein Modeling

This article has 23 authors:
1. Jianwei Zhu
2. Yu Shi
3. Ran Bi
4. Peiran Jin
5. Chang Liu
6. Zhe Zhang
7. Haitao Huang
8. Zekun Guo
9. Pipi Hu
10. Fusong Ju
11. Lin Huang
12. Xinwei Tai
13. Chenao Li
14. Kaiyuan Gao
15. Xinran Wei
16. Huanhuan Xia
17. Jia Zhang
18. Yaosen Min
19. Zun Wang
20. Yusong Wang
21. Liang He
22. Haiguang Liu
23. Tao Qin
This article has no evaluationsLatest version Oct 10, 2025
Pretrained protein language models choose between sequence novelty and structural completeness

This article has 3 authors:
1. Arjuna M. Subramanian
2. Zachary A. Martinez
3. Matt Thomson
This article has no evaluationsLatest version Oct 3, 2025
RemoteFoldSet: Benchmarking Structural Awareness of Protein Language Models

This article has 2 authors:
1. Zinnia Ma
2. Neville P. Bethel
This article has no evaluationsLatest version Sep 23, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

FlexRibbon: Joint Sequence and Structure Pretraining for Protein Modeling

Pretrained protein language models choose between sequence novelty and structural completeness

RemoteFoldSet: Benchmarking Structural Awareness of Protein Language Models