GCP-VQVAE: A Geometry-Complete Language for Protein 3D Structure

Mahdi Pourmirzaei
Alex Morehead
Farzaneh Esmaili
Jarett Ren
Mohammadreza Pourmirzaei
Dong Xu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Converting protein tertiary structure into discrete tokens via vector-quantized variational autoencoders (VQ-VAEs) creates a language of 3D geometry and provides a natural interface between sequence and structure models. While pose invariance is commonly enforced, retaining chirality and directional cues without sacrificing reconstruction accuracy remains challenging. In this paper, we introduce GCPVQVAE, a geometry-complete tokenizer built around a strictly SE(3)-equivariant GCPNet encoder that preserves orientation and chirality of protein backbones. We vector-quantize rotation/translation-invariant readouts that retain chirality into a 4 096-token vocabulary, and a transformer decoder maps tokens back to backbone coordinates via a 6D rotation head trained with SE(3)-invariant objectives.

Building on these properties, we train GCP-VQVAE on a corpus of 24 million monomer protein backbone structures gathered from the AlphaFold Protein Structure Database. On the CAMEO2024, CASP15, and CASP16 evaluation datasets, the model achieves backbone RMSDs of 0.4377 Å, 0.5293 Å, and 0.7567 Å, respectively, and achieves 100% codebook utilization on a held-out validation set, substantially outperforming prior VQ-VAE–based tokenizers and achieving state-of-the-art performance. Beyond these benchmarks, on a zero-shot set of 1 938 completely new experimental structures, GCP-VQVAE attains a backbone RMSD of 0.8193 Å and a TM-score of 0.9673, demonstrating robust generalization to unseen proteins. Lastly, we elaborate on the various applications of this foundation-like model, such as protein structure compression and the integration of generative protein language models. We make the GCP-VQVAE source code, zero-shot dataset, and its pretrained weights fully open for the research community: GitHub .

Version published to 10.1101/2025.10.01.679833 on bioRxiv
Oct 3, 2025

A Structure-Aware Generative Framework for Exploring Protein Sequence and Function Space

This article has 4 authors:
1. Divyanshu Shukla
2. Jonathan Martin
3. Faruck Morcos
4. Davit A. Potoyan
This article has no evaluationsLatest version Sep 19, 2025
Flow Autoencoders are Effective Protein Tokenizers

This article has 4 authors:
1. Rohit Dilip
2. Evan Zhang
3. Ayush Varshney
4. David Van Valen
This article has no evaluationsLatest version Oct 3, 2025
FlexRibbon: Joint Sequence and Structure Pretraining for Protein Modeling

This article has 23 authors:
1. Jianwei Zhu
2. Yu Shi
3. Ran Bi
4. Peiran Jin
5. Chang Liu
6. Zhe Zhang
7. Haitao Huang
8. Zekun Guo
9. Pipi Hu
10. Fusong Ju
11. Lin Huang
12. Xinwei Tai
13. Chenao Li
14. Kaiyuan Gao
15. Xinran Wei
16. Huanhuan Xia
17. Jia Zhang
18. Yaosen Min
19. Zun Wang
20. Yusong Wang
21. Liang He
22. Haiguang Liu
23. Tao Qin
This article has no evaluationsLatest version Oct 10, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Structure-Aware Generative Framework for Exploring Protein Sequence and Function Space

Flow Autoencoders are Effective Protein Tokenizers

FlexRibbon: Joint Sequence and Structure Pretraining for Protein Modeling