MillionFull enables low-cost, massive, full-length enzyme sequence–fitness data collection for machine learning–guided enzyme engineering

Jinbei Li
Bjarke Erichsen
Simon R. Krarup
Sonia C. Yuan
Kenan Jijakli
Søren Karst
Lei Yang
Alex Toftgaard Nielsen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Machine learning holds great promise for accelerating enzyme optimization, but its power is fundamentally constrained by the limited availability of sequence–fitness data. Here, we introduce MillionFull , a low-cost method that enables high-throughput full-length sequence– fitness mapping for enzymes of arbitrary length. Each run yields on the order of 10⁵–10⁷ data points, capturing sequence–function relationships at unprecedented scale. By overcoming the data bottleneck, MillionFull provides a foundation for dramatically advancing AI-driven enzyme engineering.

Version published to 10.1101/2025.10.24.684421 on bioRxiv
Oct 25, 2025

Artificial Intelligence–Driven Structural Mining Enables Functional Inference in the Human Dark Proteome

This article has 7 authors:
1. Valentina Carbonari
2. Annamaria Defilippo
3. Ugo Lomoio
4. Caterina Francesca Perri
5. Barbara Puccio
6. Pierangelo Veltri
7. Pietro Hiram Guzzi
This article has no evaluationsLatest version Dec 23, 2025
A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025
Unlocking the genomic landscape for antimicrobial domain discovery with a two-stage progressive residue-level annotation model

This article has 13 authors:
1. Peilin Xie
2. Xingchen Liu
3. Lantian Yao
4. Zhihao Zhao
5. Anming Yang
6. Jiahui Guan
7. Zijun Jiao
8. Zhihong Liu
9. Junwen Wang
10. Tzong-Yi Lee
11. Zigang Li
12. Bingyu Cui
13. Ying-Chih Chiang
This article has no evaluationsLatest version Dec 11, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Artificial Intelligence–Driven Structural Mining Enables Functional Inference in the Human Dark Proteome

A Survey on Efficient Protein Language Models

Unlocking the genomic landscape for antimicrobial domain discovery with a two-stage progressive residue-level annotation model