Defining hierarchical protein interaction networks from spectral analysis of bacterial proteomes

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    Since the inception of comparative genomics, mining phyletic patterns has been a powerful approach for the discovery of previously unknown biological interactions. The authors use a combination of singular value decomposition of the phyletic pattern matrix and random forests classification method to uncover potential protein-protein interactions. The work illustrates the utility of such methods, which are finding increasing application in addressing various computational biological problems, such as predicting protein-protein interactions from genomic information.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Cellular behaviors emerge from layers of molecular interactions: proteins interact to form complexes, pathways, and phenotypes. We show that hierarchical networks of protein interactions can be defined from the statistical pattern of proteome variation measured across thousands of diverse bacteria and that these networks reflect the emergence of complex bacterial phenotypes. Our results are validated through gene-set enrichment analysis and comparison to existing experimentally derived databases. We demonstrate the biological utility of our approach by creating a model of motility in Pseudomonas aeruginosa and using it to identify a protein that affects pilus-mediated motility. Our method, SCALES (Spectral Correlation Analysis of Layered Evolutionary Signals), may be useful for interrogating genotype-phenotype relationships in bacteria.

Article activity feed

  1. Evaluation Summary:

    Since the inception of comparative genomics, mining phyletic patterns has been a powerful approach for the discovery of previously unknown biological interactions. The authors use a combination of singular value decomposition of the phyletic pattern matrix and random forests classification method to uncover potential protein-protein interactions. The work illustrates the utility of such methods, which are finding increasing application in addressing various computational biological problems, such as predicting protein-protein interactions from genomic information.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

  2. Reviewer #1 (Public Review):

    The manuscript of Zaydman et al. proposes a spectral analysis of phylogenetic profiles, which allows to identify signals of protein-protein interaction or association at different scales, from direct PPI over pathways to phenotypes and finally to phylogenetic relationships.

    The paper reports some potentially very interesting results:

    - Different scales are related to different (even if overlapping) windows in the spectrum of the phylogenetic profiles, with the most global scale (phylogeny) related to the largest singular values, and the most local scale (physical PPI) to much smaller singular values.

    - Using this observation, and the correlation of proteins (projections of groups of orthologs to the SVD) across windows in the spectrum, the authors are able to extract a hierarchy of protein networks, which get refined from some general phenotype (bacterial mobility in the paper) to several pathways and complexes (e.g. chemotaxis, flagellum).

    - This allows to associate proteins of unknown function to some pathways or complexes; the paper shows a case of experimental validation for one new association.

    - Using a supplementary layer of supervised machine learning (interacting and non-interacting proteins), they claim to have more precise results than some recent PPI networks reconstructed using amino-acid coevolution (Cons et al.).

    While these results seem to be highly interesting and, in some cases, potentially spectacular, the paper is very hard to read and to understand. It is written in a semi-technical jargon mixing spectral analysis, machine learning and information theory. Even having expertise in these fields, I had to continuously jump between the main text, the methods and the figure (including the supplementary figures - a total of 86 pages) to follow the argumentation of the paper. The authors should make a serious effort to ensure that the main messages become more accessible.

  3. Reviewer #2 (Public Review):

    From its inception comparative genomics has held the promise of predicting protein-protein interactions using the phyletic patterns of proteins. The current work represents another iteration in the long series of such attempts, which aims to use the increasingly popular applications of machine learning to this classic problem. The authors start by using the phyletic pattern matrix for orthologous proteins and perform singular value decomposition on it to obtain the successive SVD components. They observed that the higher ranked SVD components were dominated by information from the phylogenetic relationships between organisms. However, there was a large unaccounted variance contained in the lower components, which they sought to further query for potential biologically relevant information such as indirect interactions and direct interactions, such as PPIs. They assembled benchmarks using known biological databases for assessing the inferred interactions which were derived from the "spectral correlation" which they obtained from row correlations in the U and V matrices of the decomposition of their ortholog phyletic pattern matrix. Given that the correlations can be a mix of all kinds of signals, including phylogenetic, indirect and direct interactions, they used a gold-standard set of well characterized E. coli K12 protein pairs to train random forest models for learning direct PPIs.

    The attractive aspects of this work include: 1) the use of a comprehensive phyletic pattern matrix for orthologs; 2) A reliable training set for the random forest method; 3) the assembly of multiple benchmarking sets with thorough benchmarking of the method. 4) Recovery of subsystems of bacterial flagellar motility and other systems.

    Weaknesses: 1) Bacteria tree is not uniformly sequenced. There is an overrepresentation of certain lineages, e.g., of gammaproteobacteria and terrabacteria (Bacillus group) in the starting matrix. This could potentially bias the quality of the correlations that are obtained in the ``mid-range' SVD components; 2) The actual biological inferences drawn for the role of the tested gene in twitching mobility might be over-interpreted. Briefly, the authors recover 4 uncharacterized proteins (Q9I5G6, Q9I5R2, Q9I0G2, Q9I0G1) as part of their T4 pilus sub-graph and infer a general function for them in the twitching mobility. They chose Q9I5G6 because it was the only one with a supposed domain of unknown function (DUF4845). However, it should be noted that Q9I5R2 also contains another such domain DUF805 along with a Zn-ribbon domain. Further, Q9I0G2 is a T2SS secretion platform protein and Q9I0G1i is the ATPase engine for the pilus. Genomic neighborhood analysis by this referee revealed that DUF4845 likely functions with the signal peptidase in secretion. Thus, given the role of the pilus in secretion and mobility, the best one could infer is a role for DUF4845 in pilus function perhaps with a greater intersection with secretion. This could even indirectly affect the mobility function which the authors' experiments are said to support. However, the authors state right in the abstract they have uncovered a twitching mobility effector. At best they could say they have uncovered a potential component that might be functionally linked to the T4 pilus which might affect secretion or twitching mobility. Indeed, the phyletic pattern of DUF4845 does not immediately suggest that all organisms with it also possess definitive twitching mobility.

    While methods of this kind have the promise to serve biological functional inference, the actual example provided does not appear to be the strongest. That said, I do think the work presents a method that might have utility in computational inferences of function, especially if combined with other forms of information from comparative genomics.

  4. Reviewer #3 (Public Review):

    The authors describe a computational prediction framework aimed at connecting individual genes into progressively larger units of function: from protein complexes to higher-order pathways. The framework is based on the tracking of the presence and absence of orthologous genes across a large number of genomes; the authors' method is demonstrated to work well, albeit only for prokaryotic organisms. The basic evolutionary signal used by the authors has been described previously, and has been used previously to predict protein-protein associations, but the authors take it a step further by carefully deconstructing the signal into multiple components: a phylogenetic component, a direct protein-protein interaction component, and a more indirect association component. They then construct a hierarchical model of functional linkages, for any prokaryotic genome of interest. Finally, they use this to predict and experimentally verify the function of a previously uncharacterized protein in Pseudomonas aeruginosa.

    This is a well-written and carefully executed study, taking a known prediction technique to a new level. It has broad applicability, and should be of interest to a wide readership.