Zero-shot segmentation using embeddings from a protein language model identifies functional regions in the human proteome
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The biological function of a protein is often determined by its distinct functional units, such as folded domains and intrinsically disordered regions. Identifying and categorizing these protein segments from sequence has been a major focus in computational biology which has enabled the automatic annotation of folded protein domains. Here we show that embeddings from the unsupervised protein language model ProtT5 can be used to identify and categorize protein segments without relying on conserved patterns in primary amino acid sequence. We present Zero-shot Protein Segmentation (ZPS), where we use embeddings from ProtT5 to predict the boundaries of protein segments without training or fine-tuning any parameters. We find that ZPS boundary predictions for the human proteome are more consistent with reviewed annotations from UniProt than established bioinformatics tools and ProtT5 embeddings of ZPS segments can categorize folded domains, sub-domains, and intrinsically disordered regions. To explore ZPS predictions, we introduce a new way to visualize protein embeddings that closely resembles diagrams of distinct functional units in protein biology. Since ZPS and segment embeddings can be used without training or fine-tuning, the approach is not biased towards known annotations and can used to identify and categorize unannotated protein segments. We used the segment embeddings to identify unannotated mitochondrion targeting signals and SYGQ-rich prion-like domains, which are functional regions within intrinsically disordered regions. We expect the protein segment organization revealed here to lead to valuable information about protein function, including about intrinsically disordered regions and other less understood protein regions.