RemoteFoldSet: Benchmarking Structural Awareness of Protein Language Models

Zinnia Ma
Neville P. Bethel

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Protein language models (pLMs) have the capacity to infer structural information from amino acid sequences. Evaluating the extent to which structural information they truly encode is crucial for assessing their generalizability and the interpretability of their latent representations, yet current approaches lack a model-free, quantitative framework to evaluate these encodings. We introduce RemoteFoldSet, a curated collection of protein sequence sets stratified by high structural similarity but minimal sequence identity. We also define the Structural Awareness (SA) score, a novel metric that enables model-agnostic, unsupervised, and training-free quantification of structure-related patterns in pLM embeddings. Using RemoteFoldSet together with the SA score, we benchmark a range of existing pLMs, elucidating how models with different training objectives, architectures, and sizes discriminate and distribute proteins within their embedding spaces, both quantitatively and qualitatively. We expect that this methodology will serve as a reliable benchmark for evaluating the performance of pLMs for structural and functional applications.

Version published to 10.1101/2025.09.23.678152 on bioRxiv
Sep 23, 2025

Protein Language Models Capture Structural and Functional Epistasis in a Zero-Shot Setting

This article has 5 authors:
1. Ananthan Nambiar
2. Sayantani B. Littlefield
3. Carlos Cuellar
4. Rohit Khorana
5. Sergei Maslov
This article has no evaluationsLatest version Sep 17, 2025
NucleicBERT: Deciphering the language of nucleic acids by a large-language model

This article has 4 authors:
1. Utkarsh Upadhyay
2. Julian Herold
3. Markus Götz
4. Alexander Schug
This article has no evaluationsLatest version Sep 6, 2025
FusionProt: Fusing Sequence and Structural Information for Unified Protein Representation Learning

This article has 3 authors:
1. Dan Kalifa
2. Uriel Singer
3. Kira Radinsky
This article has no evaluationsLatest version Aug 8, 2025

Listed in

Abstract

Article activity feed

Related articles

Protein Language Models Capture Structural and Functional Epistasis in a Zero-Shot Setting

NucleicBERT: Deciphering the language of nucleic acids by a large-language model

FusionProt: Fusing Sequence and Structural Information for Unified Protein Representation Learning