Balancing Locality and Reconstruction in Protein Structure Tokenizer

Jiayou Zhang
Barthelemy Meynard-Piganeau
James Gong
Xingyi Cheng
Yingtao Luo
Hugo Ly
Le Song
Eric Xing

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The structure of a protein is crucial to its biological function. With the expansion of available protein structures, such as those in the AlphaFold Protein Structure Database (AFDB), there is an increasing need for efficient methods to index, search, and generate these structures. Additionally, there is a growing interest in integrating structural information with models from other modalities, such as protein sequence language models. We present a novel VQ-VAE-based protein structure tokenizer, AIDO.StructureTokenizer (AIDO.St), which is a pretrained module for protein structures in an AI-driven Digital Organism [1]. AIDO.StructureTokenizer is a 300M parameter model consisting of an equivariant encoder to discretize input structures into tokens, and an invariant decoder to reconstruct the inputs from these tokens. In addition to evaluating structure reconstruction ability, we also compared our model to Foldseek, ProToken, and ESM3 in terms of protein structure retrieval ability. Through our experiments, we discovered an intriguing trade-off between the encoder’s locality and retrieval ability and the decoder’s reconstruction ability. Our results also demonstrate that a better balance between retrieval and reconstruction enables a better alignment between the structure tokens and a protein sequence language model, resulting in better structure prediction accuracy. Models and code are available through ModelGenerator in https://github.com/genbio-ai/AIDO and on Hugging Face .

Version published to 10.1101/2024.12.02.626366v2 on bioRxiv
Dec 6, 2024
Version published to 10.1101/2024.12.02.626366v1 on bioRxiv
Dec 5, 2024

Rapid and accurate protein structure database search using inverse folding model and contrastive learning

This article has 5 authors:
1. Qiuyi Lyu
2. Hong Wei
3. Shuaishuai Chen
4. Zhenling Peng
5. Jianyi Yang
This article has no evaluationsLatest version May 20, 2025
SoftAlign: End-to-end protein structures alignment

This article has 9 authors:
1. Jeanne Trinquier
2. Samantha Petti
3. Sukhwan Park
4. Kithmini Herath
5. Michel van Kempen
6. Shihao Feng
7. Johannes Söding
8. Martin Steinegger
9. Sergey Ovchinnikov
This article has no evaluationsLatest version May 14, 2025
SSAlign: Ultrafast and Sensitive Protein Structure Search at Scale

This article has 4 authors:
1. Lei Wang
2. Xuchao Zhang
3. Yan Wang
4. Zhidong Xue
This article has no evaluationsLatest version Jul 5, 2025

Listed in

Abstract

Article activity feed

Related articles

Rapid and accurate protein structure database search using inverse folding model and contrastive learning

SoftAlign: End-to-end protein structures alignment

SSAlign: Ultrafast and Sensitive Protein Structure Search at Scale