Integrating Millions of Years of Evolutionary Information into Protein Structure Models for Function Prediction

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Understanding life processes relies on accurate protein function prediction, fundamentally requiring the integration of evolutionary information encoded in sequences with spatial characteristics from 3D structures. Existing approaches often face limitations, however, by either over-relying on sequence, using simplified structural representations instead of fine-grained spatial details, or failing to capture the synergistic relationship between sequence and structure, compounded by challenges in acquiring annotated data.

Results

To address these issues, we propose a novel contrast-aware pre-training framework, ESMSCOP. ESMSCOP leverages a state-of-the-art protein language model to harness evolutionary insights embedded in sequences, and introduces a new encoder to fuse topological and fine-grained spatial structural features. By employing a contrastive pre-training strategy with auxiliary supervision, ESMSCOP effectively bridges the sequence-structure gap, yielding rich and informative representations.

Conclusions

Extensive experiments conducted on multiple benchmark datasets demonstrate that ESMSCOP achieves superior performance in protein function prediction tasks compared to existing methods. Furthermore, it shows strong performance even when utilizing relatively less pre-training data compared to some large-scale models.

Article activity feed