Dual-encoder contrastive learning accelerates enzyme discovery
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The ability to engineer enzymes for desired reactions is a cornerstone of modern biotechnology, yet identifying suitable starting proteins remains a critical bottleneck. Dual-encoder contrastive learning models have emerged as a promising approach for enzyme discovery, learning to match chemical reactions with catalyzing enzymes through a shared embedding space. However, their practical performance beyond computational benchmarks remains unproven. Here, we close this gap with Horizyn-1, a deep learning framework for direct reaction-to-enzyme recommendation validated through comprehensive experimental testing. Leveraging a computationally efficient combination of reaction fingerprints and protein language models, we trained Horizyn-1 on 8.9 million reaction-enzyme pairs to achieve state-of-the-art performance, recovering an enzyme with correct activity within the top 100 hits for over 75% of test reactions. We experimentally validate Horizyn-1 across three enzyme discovery scenarios: identifying enzymes for orphan reactions, predicting enzyme promiscuity for both characterized and uncharacterized enzymes, and discovering enzymes for non-natural biochemical reactions including lysine-driven transaminations that enable efficient synthesis of non-canonical amino acids. On underrepresented reaction classes, we find that fine-tuning with fewer than 10 additional reactions can dramatically improve performance. Furthermore, a logarithmic scaling of model performance with training dataset size suggests continued improvement with larger and more diverse reaction datasets. Horizyn-1 addresses the critical bottleneck of sourcing initial enzymes for optimization campaigns, enabling efficient and scalable in silico screening for enzymes with desired activities and promising to accelerate future efforts in biocatalysis and metabolic engineering.