From Atoms to Fragments: A Coarse Representation for Functional and Efficient Protein Design
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Deep learning has made remarkable progress in protein design, yet current protein representations remain largely black-box and scale poorly with protein length, leading to high computational costs. We propose a fragment-based protein representation that balances interpretability and efficiency. Using a curated set of 40 evolutionarily conserved fragments, we represent proteins as fragment sets or fragment graphs, significantly reducing dimensionality while preserving functional information. Here, we show that fragment-based representations capture significantly more information at much lower dimensions compared to traditional methods. On a dataset of 215 functionally diverse proteins, our approach outperforms traditional sequence- and structure-based methods in clustering by protein function at ≤ 30% sequence identity. Additionally, fragment-based search achieves comparable accuracy while using 90% fewer tokens. It also runs ∼ 68.7× faster than RMSD-based methods and ∼ 1.64× faster than sequence-based methods, even when including fragment pre-processing overhead. Finally, we show that fragments can guide RFDiffusion backbone generation, with recovery rates higher than 40%. We propose fragment-based representations as a scalable and interpretable alternative for the next generation of protein design tools, spanning backbone and sequence design to functional searches in protein structure databases.