Machine learning-assisted enzyme engineering through ultra-high throughput sorting and large-scale sequence-function data generation
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Machine learning (ML) shows great promise in protein engineering but has yet to be integrated with ultra-high throughput sorting (ultra-HTS) and NGS for large-scale sequence-function data generation to harness its capability to explore wider search space and more complex mutation events. Here, we introduce PUSDA, a framework that rapidly sorts mutant libraries into multiple performance groups with good accuracy and generates large-scale sequence-function data to power ML-driven protein design. As a demonstration, PUSDA generated over five million sequence-function data of an enzyme, with data processing revealing over 1.3 million unique enzyme mutants being sorted in a single day. With a trained ML model that achieved 93.52% accuracy, we further analysed combinatorial mutation events and applied a ratio-based selection approach to design novel enzyme sequences. Validation experiment demonstrated a 16.67-fold improvement in efficiency of identifying high-performance enzymes using PUSDA compared to using ultra-HTS alone. The designed novel enzyme achieved 8.23-fold increase in productivity compared to wild type. PUSDA lays a foundation to integrate ultra-HTS, NGS, and ML for future predictive enzyme engineering, offering a data-driven tool for accelerating breakthroughs in biotechnology.