Mind the Gap: An Embedding Guide to Safely Travel in Sequence Space

Adam Wu
Quentin Trolliet
Abhinav Rajendran
Jakub Lála
Stefano Angioletti-Uberti

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We present a hybrid approach combining a protein language model (pLM) with Monte Carlo (MC) sampling for generating enzyme mutants free of mutations deleterious for structural preservation. Given the amino acid sequence of the original enzyme and a set of residues for which the local environment should be conserved, i.e., the catalytic site, our approach generates mutants that differ vastly in the overall sequence while retaining the geometry of the conserved region, thereby representing promising candidates for further experimental screening. Unlike end-to-end deep learning approaches, whose results are harder to interpret and control, the use of a well-established, classic technique such as MC sampling allows us to easily interpret the generative process as the sampling of an energy landscape determined by the pLM. In turn, such an interpretation enables us to steer this generative process and control its outcome by making use of robust statistical mechanics concepts, e.g., temperature, thereby explicitly guaranteeing certain properties of the generated mutants. Given the increasing relevance of generative algorithms in the design and search for novel, optimised enzymes, we believe that our results constitute an important step for the future development of this class of techniques. To facilitate experimental verification, we finally provide hundreds of sequences for 13 different enzymes involved in catalytic processes ranging from carbon dioxide conversion to DNA replication.

Version published to 10.1101/2025.06.19.660524 on bioRxiv
Jun 20, 2025

GENERator: A Long-Context Generative Genomic Foundation Model

This article has 18 authors:
1. Qiuyi Li
2. Wei Wu
3. Yuanyuan Zhang
4. Zhihao Zhan
5. Ruipu Chen
6. Mingyang Li
7. Kun Fu
8. Junyan Qi
9. Yongzhou Bao
10. Chao Wang
11. Yiheng Zhu
12. Zhiyun Zhang
13. Jian Tang
14. Fuli Feng
15. Jieping Ye
16. Liu Yuwen
17. Hui Xiong
18. Zheng Wang
This article has no evaluationsLatest version Feb 4, 2026
In-Context Learning in Genomic Language Models as a Biological Evaluation Task

This article has 2 authors:
1. Aadit Kapoor
2. Wendy Lee
This article has no evaluationsLatest version Dec 9, 2025
Emergence of Biological Structural Discovery in General-Purpose Language Models

This article has 1 author:
1. Liang Wang
This article has no evaluationsLatest version Jan 8, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

GENERator: A Long-Context Generative Genomic Foundation Model

In-Context Learning in Genomic Language Models as a Biological Evaluation Task

Emergence of Biological Structural Discovery in General-Purpose Language Models