Establishing comprehensive quaternary structural proteomes from genome sequence

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    This study presents an important platform for mapping mutation effects onto higher-level protein structural information, addressing a significant gap in current research. While the work is ambitious and incorporates often-overlooked aspects of higher-order structure, the strength of the evidence supporting some results seems incomplete. The quaternary structure modeling appears to underestimate oligomeric proteins compared to previous studies, and the mutation analysis lacks crucial baseline information. Despite these limitations, the method has potential for broader applications and generalization to additional organisms, warranting further development and refinement.

This article has been Reviewed by the following groups

Read the full article

Abstract

A critical body of knowledge has developed through advances in protein microscopy, protein-fold modeling, structural biology software, availability of sequenced bacterial genomes, large-scale mutation databases, and genome-scale models. Based on these recent advances, we develop a computational framework that; i) identifies the oligomeric structural proteome encoded by an organism’s genome from available structural resources; ii) maps multi-strain alleleomic variation, resulting in the structural proteome for a species; and iii) calculates the 3D orientation of proteins across subcellular compartments with residue-level precision. Using the platform, we; iv) compute the quaternary E. coli K-12 MG1655 structural proteome; v) use a dataset of 12,000 mutations to build Random Forest classifiers that can predict the severity of mutations; and, in combination with a genome-scale model that computes proteome allocation, vi) obtain the spatial allocation of the E. coli proteome. Thus, in conjunction with relevant datasets and increasingly accurate computational models, we can now annotate quaternary structural proteomes, at genome-scale, to obtain a molecular-level understanding of whole-cell functions.

Article activity feed

  1. eLife assessment

    This study presents an important platform for mapping mutation effects onto higher-level protein structural information, addressing a significant gap in current research. While the work is ambitious and incorporates often-overlooked aspects of higher-order structure, the strength of the evidence supporting some results seems incomplete. The quaternary structure modeling appears to underestimate oligomeric proteins compared to previous studies, and the mutation analysis lacks crucial baseline information. Despite these limitations, the method has potential for broader applications and generalization to additional organisms, warranting further development and refinement.

  2. Reviewer #1 (Public review):

    Summary:

    This work presents a computational platform that integrates currently available experimental or precomputed datasets and/or state-of-the-art modeling methods to assemble a proteome structure from a given list of genes (representing a whole proteome of an organism, or some specific subset of interest). The main advancement is that the proteome structure contains not only the tertiary structure information (such as is provided by precomputed AlphaFold predicted proteomes) but also information about the quaternary structure. Adding quaternary structure information on the whole proteomes is a challenging problem (and the manuscript would benefit from a more comprehensive introduction section presenting these challenges). Importantly, this addition of quaternary structure information is likely to significantly improve any downstream modelling or prediction. This is because most proteins form either stable or transient complexes, and a significant proportion of proteins interacts with cellular structures such as the different biological membranes. These interactions provide important context for interpreting residue-level information, such as for example the fitness/functional effects of point mutations.

    Strengths:

    The main strength of this work is that it approaches the question of protein quaternary structure in a comprehensive way. Namely, in addition to oligomeric state, it also includes membrane and cellular localization. It also demonstrates how to use and combine the available experimental and precomputed modelling to achieve the same for any set of genes.

    Weaknesses:

    The feasibility of obtaining a similar dataset (of useful/informative size) for a more complex organism is not clear.

  3. Reviewer #2 (Public review):

    In this study, a methodology called QSPACE is developed and presented. It integrates structural information for a specific organism, here E. coli. The process entails the gathering of individual structures, including oligomeric information/stoichiometry, the incorporation of data on transmembrane regions, and the utilization of the resulting dataset for the analysis of mutation effects and the allocation of proteomes.

    This work aims high, setting an ambitious goal of modeling the quaternary structure of a proteome. The method could be applied to other organisms in the future and has value in that respect. At the same time, the work tries to cover (too?) much ground and some of the results/analyses don't measure up. There are indeed a number of shortcomings and/or inconsistencies in the results presented. The comments below will help improve the work and its usefulness.

    (1) It is described that "QSPACE then finds the 3D coordinate file (i.e. "structure") that best reflects the user-defined (input #2) multi-subunit protein assembly". What is meant by "best reflects"? What if two different structures with the same stoichiometry are available? Which one is picked?

    (2) There appears to be a significant under-estimation of oligomer formation: it is reported that "31% (1,334/4,309) of E. coli genes participate in 1,047 oligomeric complexes, 667 genes are annotated as monomers, and 2,308 genes are not included". However, it is generally observed that ~50% of E coli genes form homo-oligomers (see PMID 10940245 or more recently 38325366), and adding hetero-oligomers on top of that should increase the fraction of oligomers further. In that respect, the estimate forming the basis of this work (31% of genes participating in oligomeric complexes) seems incorrect. It is unclear why the authors did not identify more proteins as adopting a quaternary structure. It is generally hard to grasp details of the dataset, for example, the simple statistic of how many genes participate in homo- versus hetero- oligomer. Such information is partially presented in panels 2c & 2d, but it is very small and hard to see (I would suggest removing the structures of the ABC transporters to make space to present this with more detail).

    (3) There are a number of misleading statements/overstatements that I encourage the authors to revise. For example (not exhaustive):
    "to our knowledge this result is the most advanced genome-scale structural representation of the E. coli proteome and de facto represents a major advancement in genome annotation."
    "angstrom-level subcellular compartmentalization" - Can we really talk about sub-atomic precision when even side chains can move by several angstroms?
    "we provide a global accounting of all functionally important regions" - "all" is not justified
    "Incorporated into genome-scale models that compute protein expression" - what does that mean? There are gene expression & protein abundance datasets, why is the "compute" necessary?
    "Likewise, sequence-based prediction software (e.g., DeepTMHMM49) and structure-based prediction software (e.g., OPM50) are agnostic to membrane orientation and can also generate erroneous results" - what does "erroenous results" mean in this context? Those tools are not supposed to predict orientation.

    (4) What was the benchmark used to estimate the accuracy of orientation assignments?

    (5) It is not clear why structural information is required to calculate the volume taken up by different proteins across the proteome. For each protein, the expression level (copy number) is expected to have a significant effect, but I'm unsure of why oligomerization is considered key here. It will modulate the volume exclusion associated with interface contact areas, but isn't this negligible compared to other factors, in particular expression?

    (6) Models aiming at predicting deleterious effects of mutations typically use sequence conservation, but I do not see such information used in Figure 4. Assessing the added value of structural information should include such evolutionary information (residue-level sequence conservation) in the baseline.

    (7) The "proteome allocation" analysis is presented as an important result, but I did not find details of equations used to conduct this analysis. I assume that "proteome allocation" is based solely on expression, and that "cell volume" uses structural information on top of it. There is a significant difference between "proteome allocation" and "cell volume" as reflected in the proteomaps shown in panels 4e & 4f, but there is no explanation for it. Are the proteins' identities the same in these two panels? Were only proteins counted or was RNA considered as well? Clarifications are needed for RNA, for example, how were volumes calculated in structures containing RNAs? Datasets used to derive these maps should also be provided to enable reproducing them.

    (8) I did not see that the structures generated are available - they should be deposited on a permanent repository with a DOI.