From Atoms to Fragments: A Coarse Representation for Efficient and Functional Protein Design

Leonardo V. Castorina
Christopher W. Wood
Kartic Subr

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Motivation

Although deep learning has accelerated protein design, current protein representations such as sequences or full-atom structures scale non-linearly with protein length. We propose a sparse and interpretable representation for proteins, based on evolutionarily conserved fragments. Specifically, we use a curated set of 40 functional and evolutionarily conserved fragments as an alphabet to build Fragment Graphs and Fragment Sets. These fragment-based representations are both lightweight and functionally informative, capturing up to 55% more variance using fewer than of the dimensions required by traditional methods.

Results

On a dataset of 215 functionally diverse proteins, our approach creates more coherent functional clusters than traditional sequence- and structure-based methods, even among proteins with ≤ 30% sequence identity. Fragment-based searches of protein databases achieve accuracies comparable to traditional methods, while using 90% fewer tokens per protein. These searches execute ∼68.7× faster than RMSD-based structural methods and ∼1.64× faster than sequence-based methods, even including fragment pre-processing overhead. Additionally, we show that our representation effectively guides RFDiffusion for protein backbone generation with functional recovery rates higher than 40%. In summary, our fragment-based representation offers a scalable and interpretable alternative for the next generation of protein design tools for backbone design, sequence design, and functional similarity searches within protein structure databases.

Availability

https://github.com/wells-wood-research/tessera (Documentation to be made available upon acceptance)

Version published to 10.1101/2025.03.19.644162 on bioRxiv
Mar 19, 2025

A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025
Quantum-Assisted Refinement of AlphaFold Protein Structures

This article has 1 author:
1. Parham Ghayour
This article has no evaluationsLatest version Dec 31, 2025
Integrating Evolutionary and Compositional Features with ML and DL for Robust and Interpretable Druggable Protein Prediction

This article has 5 authors:
1. Mujeebu Rehman
2. Qinghua Liu
3. Muhammad Javed
4. Ali Ghulam
5. Teerath Kumar
This article has no evaluationsLatest version Dec 11, 2025

Discuss this preprint

Listed in

Abstract

Motivation

Results

Availability

Article activity feed

Related articles

A Survey on Efficient Protein Language Models

Quantum-Assisted Refinement of AlphaFold Protein Structures

Integrating Evolutionary and Compositional Features with ML and DL for Robust and Interpretable Druggable Protein Prediction