A Structure-Aware Generative Framework for Exploring Protein Sequence and Function Space
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rapid expansion of protein sequence databases has far outpaced experimental structure determination, leaving many unannotated sequences, particularly the more remote homologs with low sequence identity. Because protein folds are more conserved and functionally informative than sequences alone, structural information offers a powerful lens for analysis. Here, we introduce a generative, structure-aware framework that integrates geometric encoding and coevolutionary constraints to map, cluster, and design protein sequences. Our approach employs the 3D interaction (3Di) alphabet to convert local residue geometries into compact, 20-state discrete representations. Using ProstT5, we enable bidirectional translation between amino acid sequences and 3Di representations, facilitating sensitive homology detection and structure-guided sequence generation. We construct a latent sequence landscape by combining 3Di-based alignments with direct coupling analysis (DCA) and variational autoencoders (VAE), unifying tasks such as clustering, annotation, and design. This integrative framework enhances the detection of coevolutionary signals and enables rational sampling of structural variants, even without functional labels. We demonstrate the utility of our method across diverse protein families, including globins, kinases, and malate dehydrogenases, achieving improved contact prediction, homology inference, and sequence generation. Together, our approach offers a quantitative, generative view of protein structure space, advancing protein evolution, and design studies.