Machine learning-assisted enzyme engineering through ultra-high throughput sorting and large-scale sequence-function data generation
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
Machine learning (ML) shows great promise in protein engineering but has yet to be integrated with ultra-high throughput sorting (ultra-HTS) and NGS for large-scale sequence-function data generation to harness its capability to explore wider search space and more complex mutation events. Here, we introduce PUSDA, a framework that rapidly sorts mutant libraries into multiple performance groups with good accuracy and generates large-scale sequence-function data to power ML-driven protein design. As a demonstration, PUSDA generated over five million sequence-function data of an enzyme, with data processing revealing over 1.3 million unique enzyme mutants being sorted in a single day. With a trained ML model that achieved 93.52% accuracy, we further analysed combinatorial mutation events and applied a ratio-based selection approach to design novel enzyme sequences. Validation experiment demonstrated a 16.67-fold improvement in efficiency of identifying high-performance enzymes using PUSDA compared to using ultra-HTS alone. The designed novel enzyme achieved 8.23-fold increase in productivity compared to wild type. PUSDA lays a foundation to integrate ultra-HTS, NGS, and ML for future predictive enzyme engineering, offering a data-driven tool for accelerating breakthroughs in biotechnology.
Article activity feed
-
While the study showed that there is good predictive accuracy by the ML between high and low-performance mutants, the prediction between high and medium-performance groups was less accurate, likely due to less distinct sequence differences between these groups.
Do you think this will be consistent across proteins or do you think that some of this is protein-specific? I'm interested to see this method applied to new proteins in the future!
-
Low (L, < 6 µM)
Does this also include cells where no apigenin was produced? I'm curious about how cases like that or when the variant is just totally not functionally factor into this analysis.
-
Mutations for H group clustered at AA89, AA160-AA174, AA189, AA203, and AA206. In the M group, most of the mutations were located at AA160-AA174, AA189, and AA203. In the L group, most mutations were observed at AA89, AA120, AA138, AA159-AA180, AA188, and AA205.
Interesting that there's quite a bit of overlap between all of these! Does the grouping of mutations to particular regions of the protein have to do with the strategy used to generate variants?
-
single mutation event occurrences
Are all of your variants single amino acid substitutions? Do you have variants that have multiple substitutions?
-