A General Transformer-Based Multi-Task Learning Framework for Predicting Interaction Types between Enzyme and Small Molecule
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Predicting enzyme–small molecule interactions is critical for drug discovery and generally understanding the biochemical processes of life. While recent deep learning approaches have shown promising results, several challenges remain: the lack of a comprehensive training dataset, architectures lacking communication between representations of enzymes and small molecules, the tendency to simplify the problem as enzyme–substrate vs. enzyme—non-interacting, and thereby, misclassify enzyme–inhibitor pairs as substrates, and the negligence of the true impact of data leakage on the model’s performance. To address these issues, we present EMMA ( E nzyme–small M olecule interaction M ulti-head A ttention), a transformer-based multi-task learning framework designed to learn pairwise interaction signals between enzymes and small molecules, that can well generalize to out-of-distribution data. EMMA operates directly on the SMILES string representation of small molecules and enzyme sequences, with two classification heads that distinguish enzyme—non-interacting, enzyme—substrate, and enzyme—inhibitor pairs. By evaluating EMMA under five distinct data-splitting regimes that control for different types of data leakage, we demonstrate that EMMA achieves a strong and robust performance, particularly for previously unseen combinations of enzymes and small molecules. Further, a deeper analysis highlights that the topological properties of the enzyme—small molecule interaction network are crucial for the model performance and its ability to generalize, yet again stressing the decisive role of well-designed training datasets for successful model training.