Paying Attention to Attention: High Attention Sites as Indicators of Protein Family and Function in Language Models

Gowri Nayar
Alp Tartici
Russ B. Altman

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Protein Language Models (PLMs) use transformer architectures to capture patterns within protein sequences, providing a powerful computational representation of the protein sequence [1]. Through large-scale training on protein sequence data, PLMs generate vector representations that encapsulate the biochemical and structural properties of proteins [2]. At the core of PLMs is the attention mechanism, which facilitates the capture of long-range dependencies by computing pairwise importance scores across residues, thereby highlighting regions of biological interaction within the sequence [3]. The attention matrices offer an untapped opportunity to uncover specific biological properties of proteins, particularly their functions. In this work, we introduce a novel approach, using the Evolutionary Scale Model (ESM) [4], for identifying High Attention (HA) sites within protein sequences, corresponding to key residues that define protein families. By examining attention patterns across multiple layers, we pinpoint residues that contribute most to family classification and function prediction. Our contributions are as follows: (1) we propose a method for identifying HA sites at critical residues from the middle layers of the PLM; (2) we demonstrate that these HA sites provide interpretable links to biological functions; and (3) we show that HA sites improve active site predictions for functions of unannotated proteins. We make available the HA sites for the human proteome. This work offers a broadly applicable approach to protein classification and functional annotation and provides a biological interpretation of the PLM’s representation.

Author Summary

Understanding how proteins work is critical to advancements in biology and medicine, and protein language models (PLMs) facilitate studying protein sequences at scale. These models identify patterns within protein sequences by focusing on key regions of the sequence that are important to distinguish the protein. Our work focuses on the Evolutionary Scale Model (ESM), a state-of-the-art PLM, and we analyze the model’s internal attention mechanism to identify the significant residues.

We developed a new method to identify “High Attention (HA)” sites—specific parts of a protein sequence that are essential for classifying proteins into families and predicting their functions. By analyzing how the model prioritizes certain regions of protein sequences, we discovered that these HA sites often correspond to residues critical for biological activity, such as active sites where chemical reactions occur. Our approach helps interpret how PLMs understand protein data and enhances predictions for proteins whose functions are still unknown. As part of this work, we provide HA-site information for the entire human proteome, offering researchers a resource to further study the potential functional relevance of these residues.

Version published to 10.1101/2024.12.13.628435v1 on bioRxiv
Dec 17, 2024

P 3: A Framework for Predicting Protein-Protein Interactions Using Large Language Models

This article has 6 authors:
1. Lamiaa Basyoni
2. Jovana Aleksic
3. Stephanie Schaefer-Ramadan
4. Yue Guan
5. Joel Malek
6. Ahmed Serag
This article has no evaluationsLatest version May 22, 2025
Trainable subnetworks reveal insights into structure knowledge organization in protein language models

This article has 4 authors:
1. Ria Vinod
2. Ava P. Amini
3. Lorin Crawford
4. Kevin K. Yang
This article has no evaluationsLatest version Jun 1, 2025
ProtFun: A Protein Function Prediction Model Using Graph Attention Networks with a Protein Large Language Model

This article has 2 authors:
1. Muhammed Talo
2. Serdar Bozdag
This article has no evaluationsLatest version May 17, 2025

Listed in

Abstract

Author Summary

Article activity feed

Related articles

P 3: A Framework for Predicting Protein-Protein Interactions Using Large Language Models

Trainable subnetworks reveal insights into structure knowledge organization in protein language models

ProtFun: A Protein Function Prediction Model Using Graph Attention Networks with a Protein Large Language Model