RemoteFoldSet: Benchmarking Structural Awareness of Protein Language Models

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Protein language models (pLMs) have the capacity to infer structural information from amino acid sequences. Evaluating the extent to which structural information they truly encode is crucial for assessing their generalizability and the interpretability of their latent representations, yet current approaches lack a model-free, quantitative framework to evaluate these encodings. We introduce RemoteFoldSet, a curated collection of protein sequence sets stratified by high structural similarity but minimal sequence identity. We also define the Structural Awareness (SA) score, a novel metric that enables model-agnostic, unsupervised, and training-free quantification of structure-related patterns in pLM embeddings. Using RemoteFoldSet together with the SA score, we benchmark a range of existing pLMs, elucidating how models with different training objectives, architectures, and sizes discriminate and distribute proteins within their embedding spaces, both quantitatively and qualitatively. We expect that this methodology will serve as a reliable benchmark for evaluating the performance of pLMs for structural and functional applications.

Article activity feed