Protein codes promote selective subcellular compartmentalization

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Cells have evolved mechanisms to distribute ~10 billion protein molecules to subcellular compartments where diverse proteins involved in shared functions must assemble. In this study, we demonstrate that proteins with shared functions share amino acid sequence codes that guide them to compartment destinations. We developed a protein language model, ProtGPS, that predicts with high performance the compartment localization of human proteins excluded from the training set. ProtGPS successfully guided generation of novel protein sequences that selectively assemble in the nucleolus. ProtGPS identified pathological mutations that change this code and lead to altered subcellular localization of proteins. Our results indicate that protein sequences contain not only a folding code but also a previously unrecognized code governing their distribution to diverse subcellular compartments.

Article activity feed

  1. The area under the receiver operator curve (AUC-ROC) showed that protein compartments could be predicted with remarkable accuracy (0.83-0.95) across the 12 different compartments (Fig. 1D).

    ESM2 performance can be sensitive to the makeup of training data used (e.g. https://www.biorxiv.org/content/10.1101/2024.03.07.584001v1.abstract). Specifically, class biases in training data can be recapitulated in generated sequences.

    Given that AUC-ROC varies as a function of compartment type (Fig 1D) and the compartments themselves are associated with diverse input sequence numbers (Fig 1B), I wonder if you examined possible biases in ProtGPS's behavior? Does ProtGPS more readily generate sequences that are suited for certain compartments than others? Is this explainable by the statistical distribution of the training data?