From Atoms to Fragments: A Coarse Representation for Efficient and Functional Protein Design
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Motivation
Although deep learning has accelerated protein design, current protein representations such as sequences or full-atom structures scale non-linearly with protein length. We propose a sparse and interpretable representation for proteins, based on evolutionarily conserved fragments. Specifically, we use a curated set of 40 functional and evolutionarily conserved fragments as an alphabet to build Fragment Graphs and Fragment Sets. These fragment-based representations are both lightweight and functionally informative, capturing up to 55% more variance using fewer than of the dimensions required by traditional methods.
Results
On a dataset of 215 functionally diverse proteins, our approach creates more coherent functional clusters than traditional sequence- and structure-based methods, even among proteins with ≤ 30% sequence identity. Fragment-based searches of protein databases achieve accuracies comparable to traditional methods, while using 90% fewer tokens per protein. These searches execute ∼68.7× faster than RMSD-based structural methods and ∼1.64× faster than sequence-based methods, even including fragment pre-processing overhead. Additionally, we show that our representation effectively guides RFDiffusion for protein backbone generation with functional recovery rates higher than 40%. In summary, our fragment-based representation offers a scalable and interpretable alternative for the next generation of protein design tools for backbone design, sequence design, and functional similarity searches within protein structure databases.
Availability
https://github.com/wells-wood-research/tessera (Documentation to be made available upon acceptance)