Automatically Defining Protein Words for Diverse Functional Predictions Based on Attention Analysis of a Protein Language Model

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Understanding the relationship between protein sequence and function remains a longstanding challenge in bioinformatics, and to date the lion’s share of related tools parse proteins at the domain or motif levels. Here, we define “protein words” as an alternative to “motif” for studying proteins and functional prediction applications. We first developed an unsupervised tool we term Protein Wordwise, which parses analyte protein sequences into protein words by analyzing attention matrices from a protein language model (PLM) through a community detection algorithm. We then developed a supervised sequence-function prediction model called Word2Function, for mapping protein words to GO terms through feature importance analysis. We compared the prediction performance of our protein word-based toolkit with a motif-based method (PROSITE) for multiple protein function datasets. We also assembled a functionally diverse data resource we term PWNet to support evaluation of protein words for predicting functional residues across 10 tasks ( e.g. , diverse biomolecular binding, catalysis, and ion-channel activity). Our toolkit outperforms PROSITE in all the examined datasets and tasks. By abandoning domains and instead using attention matrices from a PLM for automatic, systematic, and annotation-agnostic parsing of proteins, our toolkit both outperforms currently available tools for functional annotations at the residue and whole-protein levels and suggests innovative forms of protein analysis well-suited to the post-AlphaFold era of biochemistry.

Article activity feed