Diagnosing protein sequence search in the era of language models

Han Zhou
Yifan Yang
Yang Young Lu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Protein language model (PLM) based search is rapidly emerging as a successor to classical sequence alignment, with recent high-profile studies reporting substantial improvements in speed and remote homology detection. However, success on standard benchmarks does not guarantee that similarity derived from PLM embeddings constitutes reliable biological evidence. Here, we introduce PLM-GUARD, a diagnostic framework designed to interrogate the underlying meaning of protein search scores and assess their biological trustworthiness. PLM-GUARD comprises six sanity checks spanning biological fidelity, semantic validity, and manipulation safety. Across eight representative search methods, classical alignment-based systems demonstrate remarkable robustness, whereas current PLM-based methods fail broadly across all three dimensions. Notably, hybrid methods show intermediate results, indicating that alignment is still critical for ensuring biologically grounded correspondence. Our findings provide a timely clarification for the field and underscore the necessity of diagnostic evaluation as protein search enters the era of language models.

Version published to 10.64898/2026.04.26.720921 on bioRxiv
Apr 29, 2026

Benchmarking Agentic Large Language Models for Complex Protein-Set Functional Annotation

This article has 1 author:
1. Xiaoyu Zhang
Reviewed by Arcadia Science

This article has 4 evaluationsAppears in 1 listLatest version Apr 21, 2026Latest activity Apr 27, 2026
Unified sampling framework and experimental benchmarking of sequence- and structure-based protein models

This article has 8 authors:
1. Aviv Spinner
2. Pascal Notin
3. Samuel Berry
4. Dana Cortade
5. Zach Sisson
6. Svetlana Ikonomova
7. David Ross
8. Debora Marks
This article has no evaluationsLatest version May 12, 2026
Discriminative Site-Directed Protein Engineering via Lightweight CASPE Platform

This article has 10 authors:
1. Qiufeng Deng
2. Jie Qiao
3. Chuan Wang
4. Xinyue Ni
5. Yongyao Chang
6. Nan Zhao
7. Rui Zhai
8. Haiyang Cui
9. Xiujuan Li
10. Mingjie Jin
This article has no evaluationsLatest version Apr 28, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Benchmarking Agentic Large Language Models for Complex Protein-Set Functional Annotation

Unified sampling framework and experimental benchmarking of sequence- and structure-based protein models

Discriminative Site-Directed Protein Engineering via Lightweight CASPE Platform