Balancing Locality and Reconstruction in Protein Structure Tokenizer

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The structure of a protein is crucial to its biological function. With the expansion of available protein structures, such as those in the AlphaFold Protein Structure Database (AFDB), there is an increasing need for efficient methods to index, search, and generate these structures. Additionally, there is a growing interest in integrating structural information with models from other modalities, such as protein sequence language models. We present a novel VQ-VAE-based protein structure tokenizer, AIDO.StructureTokenizer (AIDO.St), which is a pretrained module for protein structures in an AI-driven Digital Organism [1]. AIDO.StructureTokenizer is a 300M parameter model consisting of an equivariant encoder to discretize input structures into tokens, and an invariant decoder to reconstruct the inputs from these tokens. In addition to evaluating structure reconstruction ability, we also compared our model to Foldseek, ProToken, and ESM3 in terms of protein structure retrieval ability. Through our experiments, we discovered an intriguing trade-off between the encoder’s locality and retrieval ability and the decoder’s reconstruction ability. Our results also demonstrate that a better balance between retrieval and reconstruction enables a better alignment between the structure tokens and a protein sequence language model, resulting in better structure prediction accuracy. Models and code are available through ModelGenerator in https://github.com/genbio-ai/AIDO and on Hugging Face .

Article activity feed