Squidly: Enzyme Catalytic Residue Prediction Harnessing a Biology-Informed Contrastive Learning Framework

William JF Rieger
Mikael Boden
Frances Arnold
Ariane Mora

Curated by eLife

eLife Assessment

The authors make an important advance in enzyme annotation by fusing biochemical knowledge with language‑model-based learning to predict catalytic residues from sequence alone. Squidly, a new ML method, outperforms existing tools on standard benchmarks and on the CataloDB dataset. The work has solid support, yet clarifications on dataset biases, ablation analyses, and uncertainty filtering would strengthen its efficiency claims.

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion

Listed in

Evaluated articles (eLife)

Abstract

Enzymes present a sustainable alternative to traditional chemical industries, drug synthesis, and bioremediation applications. Because catalytic residues are the key amino acids that drive enzyme function, their accurate prediction facilitates enzyme function prediction. Sequence similarity-based approaches such as BLAST are fast but require previously annotated homologs. Machine learning approaches aim to overcome this limitation; however, current gold-standard machine learning (ML)-based methods require high-quality 3D structures limiting their application to large datasets. To address these challenges, we developed Squidly, a sequence-only tool that leverages contrastive representation learning with a biology-informed, rationally designed pairing scheme to distinguish catalytic from non-catalytic residues using per-token Protein Language Model embeddings. Squidly surpasses state-of-the-art ML annotation methods in catalytic residue prediction while remaining sufficiently fast to enable wide-scale screening of databases. We ensemble Squidly with BLAST to provide an efficient tool that annotates catalytic residues with high precision and recall for both in- and out-of-distribution sequences.

eLife
Oct 2, 2025

eLife Assessment

The authors make an important advance in enzyme annotation by fusing biochemical knowledge with language‑model-based learning to predict catalytic residues from sequence alone. Squidly, a new ML method, outperforms existing tools on standard benchmarks and on the CataloDB dataset. The work has solid support, yet clarifications on dataset biases, ablation analyses, and uncertainty filtering would strengthen its efficiency claims.

Read the original source
eLife
Oct 2, 2025

Reviewer #1 (Public review):

In this well-written and timely manuscript, Rieger et al. introduce Squidly, a new deep learning framework for catalytic residue prediction. The novelty of the work lies in the aspect of integrating per-residue embeddings from large protein language models (ESM2) with a biology-informed contrastive learning scheme that leverages enzyme class information to rationally mine hard positive/negative pairs. Importantly, the method avoids reliance on the use of predicted 3D structures, enabling scalability, speed, and broad applicability. The authors show that Squidly outperforms existing ML-based tools and even BLAST in certain settings, while an ensemble with BLAST achieves state-of-the-art performance across multiple benchmarks. Additionally, the introduction of the CataloDB benchmark, designed to test …

Reviewer #1 (Public review):

In this well-written and timely manuscript, Rieger et al. introduce Squidly, a new deep learning framework for catalytic residue prediction. The novelty of the work lies in the aspect of integrating per-residue embeddings from large protein language models (ESM2) with a biology-informed contrastive learning scheme that leverages enzyme class information to rationally mine hard positive/negative pairs. Importantly, the method avoids reliance on the use of predicted 3D structures, enabling scalability, speed, and broad applicability. The authors show that Squidly outperforms existing ML-based tools and even BLAST in certain settings, while an ensemble with BLAST achieves state-of-the-art performance across multiple benchmarks. Additionally, the introduction of the CataloDB benchmark, designed to test generalization at low sequence and structural identity, represents another important contribution of this work.

I have only some minor comments:

(1) The manuscript acknowledges biases in EC class representation, particularly the enrichment for hydrolases. While CataloDB addresses some of these issues, the strong imbalance across enzyme classes may still limit conclusions about generalization. Could the authors provide per-class performance metrics, especially for underrepresented EC classes?

(2) An ablation analysis would be valuable to demonstrate how specific design choices in the algorithm contribute to capturing catalytic residue patterns in enzymes.

(3) The statement that users can optionally use uncertainty to filter predictions is promising but underdeveloped. How should predictive entropy values be interpreted in practice? Is there an empirical threshold that separates high- from low-confidence predictions? A demonstration of how uncertainty filtering shifts the trade-off between false positives and false negatives would clarify the practical utility of this feature.

(4) The excerpt highlights computational efficiency, reporting substantial runtime improvements (e.g., 108 s vs. 5757 s). However, the comparison lacks details on dataset size, hardware/software environment, and reproducibility conditions. Without these details, the speedup claim is difficult to evaluate. Furthermore, it remains unclear whether the reported efficiency gains come at the expense of predictive performance.

(5) Given the well-known biases in public enzyme databases, the dataset is likely enriched for model organisms (e.g., E. coli, yeast, human enzymes) and underrepresents enzymes from archaea, extremophiles, and diverse microbial taxa. Would this limit conclusions about Squidly's generalisability to less-studied lineages?

Read the original source
eLife
Oct 2, 2025

Reviewer #2 (Public review):

Summary:

The authors aim to develop Squidly, a sequence-only catalytic residue prediction method. By combining protein language model (ESM2) embedding with a biologically inspired contrastive learning pairing strategy, they achieve efficient and scalable predictions without relying on three-dimensional structure. Overall, the authors largely achieved their stated objectives, and the results generally support their conclusions. This research has the potential to advance the fields of enzyme functional annotation and protein design, particularly in the context of screening large-scale sequence databases and unstructured data. However, the data and methods are still limited by the biases of current public databases, so the interpretation of predictions requires specific biological context and experimental …

Reviewer #2 (Public review):

Summary:

The authors aim to develop Squidly, a sequence-only catalytic residue prediction method. By combining protein language model (ESM2) embedding with a biologically inspired contrastive learning pairing strategy, they achieve efficient and scalable predictions without relying on three-dimensional structure. Overall, the authors largely achieved their stated objectives, and the results generally support their conclusions. This research has the potential to advance the fields of enzyme functional annotation and protein design, particularly in the context of screening large-scale sequence databases and unstructured data. However, the data and methods are still limited by the biases of current public databases, so the interpretation of predictions requires specific biological context and experimental validation.

Strengths:

The strengths of this work include the innovative methodological incorporation of EC classification information for "reaction-informed" sample pairing, thereby enhancing the discriminative power of contrastive learning. Results demonstrate that Squidly outperforms existing machine learning methods on multiple benchmarks and is significantly faster than structure prediction tools, demonstrating its practicality.

Weaknesses:

Disadvantages include the lack of a systematic evaluation of the impact of each strategy on model performance. Furthermore, some analyses, such as PCA visualization, exhibit low explained variance, which undermines the strength of the conclusions.

Read the original source
Version published to 10.7554/elife.108186.1 on eLife
Oct 2, 2025
Version published to 10.7554/elife.108186 on eLife
Oct 2, 2025
Version published to 10.1101/2025.06.13.659624 on bioRxiv
Jun 20, 2025

Dual-encoder contrastive learning accelerates enzyme discovery

This article has 8 authors:
1. Jason W. Rocks
2. Dat P Truong
3. Dmitrij Rappoport
4. Sam Maddrell-Mander
5. Daniel Martin-Alarcon
6. Toni Lee
7. Steve Crossan
8. Joshua E. Goldford
This article has no evaluationsLatest version Aug 22, 2025
ProStab: Prediction of protein stability change upon mutations by protein language and inverse folding models

This article has 11 authors:
1. Hong Tan
2. Xiaowei Wei
3. Shenggeng Lin
4. Xueying Mao
5. Junwei Chen
6. Heqi Sun
7. Yufang Zhang
8. Zhenghong Zhou
9. Dong-Qing Wei
10. Shuangjun Lin
11. Yi Xiong
This article has no evaluationsLatest version Aug 15, 2025
EnzymeTuning: a Deep-learning-based Toolbox for Optimizing Enzyme-constrained Metabolic Modeling with Enhanced Proteome Abundance Prediction

This article has 4 authors:
1. Xueting Wang
2. Yingping Zhuang
3. Guan Wang
4. Hongzhong Lu
This article has no evaluationsLatest version Aug 27, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Dual-encoder contrastive learning accelerates enzyme discovery

ProStab: Prediction of protein stability change upon mutations by protein language and inverse folding models

EnzymeTuning: a Deep-learning-based Toolbox for Optimizing Enzyme-constrained Metabolic Modeling with Enhanced Proteome Abundance Prediction