RemoteFoldSet: Benchmarking Structural Awareness of Protein Language Models
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (Arcadia Science)
Abstract
Protein language models (pLMs) have the capacity to infer structural information from amino acid sequences. Evaluating the extent to which structural information they truly encode is crucial for assessing their generalizability and the interpretability of their latent representations, yet current approaches lack a model-free, quantitative framework to evaluate these encodings. We introduce RemoteFoldSet, a curated collection of protein sequence sets stratified by high structural similarity but minimal sequence identity. We also define the Structural Awareness (SA) score, a novel metric that enables model-agnostic, unsupervised, and training-free quantification of structure-related patterns in pLM embeddings. Using RemoteFoldSet together with the SA score, we benchmark a range of existing pLMs, elucidating how models with different training objectives, architectures, and sizes discriminate and distribute proteins within their embedding spaces, both quantitatively and qualitatively. We expect that this methodology will serve as a reliable benchmark for evaluating the performance of pLMs for structural and functional applications.
Article activity feed
-
RemoteFoldSet: Benchmarking Structural Awareness of Protein Language Models
I have a major concern regarding the dataset construction. My primary question is, why did you choose to generate synthetic sequences (and structures) instead of using natural homologs? Databases like CATH or SCOP are full of naturally occurring protein pairs that share a fold but have very low sequence identity. Using those would have grounded your benchmark in real biological evolution rather than generative noise.
Regarding your use of the "twilight zone" concept. While your dataset technically hits the 26% identity mark, I feel this misrepresents what that term actually defines. The twilight zone describes evolutionary homology, aka where sequences have diverged over millions of years due to selection and drift while maintaining structure. Your sequences, …
RemoteFoldSet: Benchmarking Structural Awareness of Protein Language Models
I have a major concern regarding the dataset construction. My primary question is, why did you choose to generate synthetic sequences (and structures) instead of using natural homologs? Databases like CATH or SCOP are full of naturally occurring protein pairs that share a fold but have very low sequence identity. Using those would have grounded your benchmark in real biological evolution rather than generative noise.
Regarding your use of the "twilight zone" concept. While your dataset technically hits the 26% identity mark, I feel this misrepresents what that term actually defines. The twilight zone describes evolutionary homology, aka where sequences have diverged over millions of years due to selection and drift while maintaining structure. Your sequences, by contrast, are hallucinations from an inverse folding model running at high temperature. Generative variance is not the same as evolutionary divergence, and a pLM recognizing ProteinMPNN's output patterns is not the same as understanding structural conservation.
Furthermore, relying entirely on synthetic validation creates a circular loop. You are testing if a pLM can recognize sequences made by ProteinMPNN and "validated" by AlphaFold3, without any experimental ground truth that these sequences actually fold. And to be frank, it's straightforward to generate high pTM AF3 structures that don't fold. Introduce a tryptophan mutation to your favorite protein. Its pTM will be almost unaffected by the mutation, but good luck expressing and purifying. A huge proportion of your dataset doesn't fold irl.
-