Structure-derived synthetic sequences guide a protein language model toward metalloproteins

Giulia Peteani
Gianmattia Sgueglia
Thomas Lemmin
Marco Chino

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Motivation

Protein language models (pLMs) capture evolutionary sequence constraints but are limited in modeling underrepresented functional classes due to training data imbalance. Metalloproteins constitute a fundamental but sparsely represented class in sequence databases. We therefore assess whether structure-conditioned synthetic sequences can be used to specialize pLMs toward metal-binding functionality.

Results

We fine-tuned the generalist model ProtGPT2 on synthetic sequences generated by the inverse-folding model ProteinMPNN, constructing training sets with controlled variation in size and diversity. Fine-tuning increased recovery of canonical metal-binding motifs from 43% in the baseline model to 91% in the fine-tuned models. Generated sequences retained high predicted structural confidence and structural similarity to known folds, despite low sequence identity. Analysis of latent representations from ProtGPT2 indicated that fine-tuned models occupy distinct regions of embedding space relative to both the baseline model and structure-conditioned sequences, consistent with partial incorporation of structural constraints while preserving sequence diversity. A multi-step filtering pipeline applied to sequences lacking canonical motifs identified candidate metal-binding sites in four-helical bundle topologies not detected in a non-redundant subset of Protein Data Bank structures or in AlphaFold-predicted proteomes.

Availability and implementation

Code, trained models, and datasets are available at: https://doi.org/10.5281/zenodo.18672158 and https://huggingface.co/gsgueglia .

Version published to 10.64898/2026.04.30.722007 on bioRxiv
May 5, 2026

Improving Biological Sequence Prediction with AlphaFold2 Representation

This article has 3 authors:
1. Zhiqian Jiang
2. Canh Hao Nguyen
3. Hiroshi Mamitsuka
This article has no evaluationsLatest version Apr 28, 2026
Unified sampling framework and experimental benchmarking of sequence- and structure-based protein models

This article has 8 authors:
1. Aviv Spinner
2. Pascal Notin
3. Samuel Berry
4. Dana Cortade
5. Zach Sisson
6. Svetlana Ikonomova
7. David Ross
8. Debora Marks
This article has no evaluationsLatest version May 12, 2026
Discriminative Site-Directed Protein Engineering via Lightweight CASPE Platform

This article has 10 authors:
1. Qiufeng Deng
2. Jie Qiao
3. Chuan Wang
4. Xinyue Ni
5. Yongyao Chang
6. Nan Zhao
7. Rui Zhai
8. Haiyang Cui
9. Xiujuan Li
10. Mingjie Jin
This article has no evaluationsLatest version Apr 28, 2026

Discuss this preprint

Listed in

Abstract

Motivation

Results

Availability and implementation

Article activity feed

Related articles

Improving Biological Sequence Prediction with AlphaFold2 Representation

Unified sampling framework and experimental benchmarking of sequence- and structure-based protein models

Discriminative Site-Directed Protein Engineering via Lightweight CASPE Platform