Paying attention to attention: High attention sites as indicators of protein family and function in language models

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Protein Language Models (PLMs) use transformer architectures to capture patterns within protein primary sequences, providing a powerful computational representation of the amino acid sequence. Through large-scale training on protein primary sequences, PLMs generate vector representations that encapsulate the biochemical and structural properties of proteins. At the core of PLMs is the attention mechanism, which facilitates the capture of long-range dependencies by computing pairwise importance scores across residues, thereby highlighting regions of biological interaction within the sequence. The attention matrices offer an untapped opportunity to uncover specific biological properties of proteins, particularly their functions. In this work, we introduce a novel approach, using the Evolutionary Scale Modelling (ESM), for identifying High Attention (HA) sites within protein primary sequences, corresponding to key residues that define protein families. By examining attention patterns across multiple layers, we pinpoint residues that contribute most to family classification and function prediction. Our contributions are as follows: (1) we propose a method for identifying HA sites at critical residues from the middle layers of the PLM; (2) we demonstrate that these HA sites provide interpretable links to biological functions; and (3) we show that HA sites improve active site predictions for functions of unannotated proteins. We make available the HA sites for the human proteome. This work offers a broadly applicable approach to protein classification and functional annotation and provides a biological interpretation of the PLM’s representation.

Article activity feed