Boosting the Predictive Power of Protein Representations with a Corpus of Text Annotations

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Protein language models are trained to predict amino acid sequences from vast protein databases, while learning to represent proteins as feature vectors. These vector representations have enabled impressive applications, from predicting mutation effects to protein folding. One of the reasons offered for the success of these models is that conserved sequence motifs tend to be important for protein fitness. Yet, the relationship between sequence conservation and fitness can be confounded by the evolutionary and environmental context. Should we therefore look to other data sources that may contain more direct functional information? In this work, we conduct a comprehensive study examining the effects of training protein models to predict nineteen types of text annotations from UniProt. Our results show that finetuning protein models on a subset of these annotations enhances the models’ predictive capabilities on a variety of function prediction tasks. Notably, our model outperforms the search algorithm BLAST, which none of the pre-trained protein models accomplished in our evaluation. Our results suggest that a much wider array of data modalities, such as text annotations, may be tapped to improve protein language models. We host our model checkpoints on https://huggingface.co/h4duan .

Article activity feed