Machine learning-assisted enzyme engineering through ultra-high throughput sorting and large-scale sequence-function data generation

Jingyun Zhang
Sangeetha Shanmugam
Jing Wui Yeoh
Dan Zheng
Jan Ron Goh
Zhangyuan Lin
Chueh Loo Poh

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Machine learning (ML) shows great promise in protein engineering but has yet to be integrated with ultra-high throughput sorting (ultra-HTS) and NGS for large-scale sequence-function data generation to harness its capability to explore wider search space and more complex mutation events. Here, we introduce PUSDA, a framework that rapidly sorts mutant libraries into multiple performance groups with good accuracy and generates large-scale sequence-function data to power ML-driven protein design. As a demonstration, PUSDA generated over five million sequence-function data of an enzyme, with data processing revealing over 1.3 million unique enzyme mutants being sorted in a single day. With a trained ML model that achieved 93.52% accuracy, we further analysed combinatorial mutation events and applied a ratio-based selection approach to design novel enzyme sequences. Validation experiment demonstrated a 16.67-fold improvement in efficiency of identifying high-performance enzymes using PUSDA compared to using ultra-HTS alone. The designed novel enzyme achieved 8.23-fold increase in productivity compared to wild type. PUSDA lays a foundation to integrate ultra-HTS, NGS, and ML for future predictive enzyme engineering, offering a data-driven tool for accelerating breakthroughs in biotechnology.

Arcadia Science
Apr 11, 2025

While the study showed that there is good predictive accuracy by the ML between high and low-performance mutants, the prediction between high and medium-performance groups was less accurate, likely due to less distinct sequence differences between these groups.

Do you think this will be consistent across proteins or do you think that some of this is protein-specific? I'm interested to see this method applied to new proteins in the future!

Read the original source
Arcadia Science
Apr 11, 2025

Low (L, < 6 µM)

Does this also include cells where no apigenin was produced? I'm curious about how cases like that or when the variant is just totally not functionally factor into this analysis.

Read the original source
Arcadia Science
Apr 11, 2025

Mutations for H group clustered at AA89, AA160-AA174, AA189, AA203, and AA206. In the M group, most of the mutations were located at AA160-AA174, AA189, and AA203. In the L group, most mutations were observed at AA89, AA120, AA138, AA159-AA180, AA188, and AA205.

Interesting that there's quite a bit of overlap between all of these! Does the grouping of mutations to particular regions of the protein have to do with the strategy used to generate variants?

Read the original source
Arcadia Science
Apr 11, 2025

single mutation event occurrences

Are all of your variants single amino acid substitutions? Do you have variants that have multiple substitutions?

Read the original source
Version published to 10.1101/2025.03.30.645636 on bioRxiv
Apr 1, 2025

Enhanced Identification of Key Bacterial Motility Genes via a Cross-Species Genomic Hybrid Feature Machine Learning Approach

This article has 6 authors:
1. Peicheng Lu
2. Qingyi Guo
3. Leyu Li
4. Muhammad Zubair
5. Guomin Han
6. Ying Chu
This article has no evaluationsLatest version Feb 9, 2026
Understanding Pathways in Bioinformatics, Genomics, and Health Applications

This article has 1 author:
1. Diptarup Mallick
This article has no evaluationsLatest version Jan 19, 2026
HitSV: Maximizing discovery of structural variants across sequencing technologies

This article has 5 authors:
1. Yadong Wang
2. Gaoyang Li
3. Yadong Liu
4. Bo Liu
5. Long Qian
This article has no evaluationsLatest version Feb 20, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Enhanced Identification of Key Bacterial Motility Genes via a Cross-Species Genomic Hybrid Feature Machine Learning Approach

Understanding Pathways in Bioinformatics, Genomics, and Health Applications

HitSV: Maximizing discovery of structural variants across sequencing technologies