De Novo multi-omics pathway analysis (DMPA) designed for prior data independent inference of cell signaling pathways

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment:

    This manuscript describes development of a new algorithm for integrative analysis of multi-omics data. This work should be of potential interest to scientists performing bioinformatic pathway discovery in multi-omic datasets especially those that relate to signaling.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

New tools for cell signaling pathway inference from multi-omics data that are independent of previous knowledge are needed. Here we propose a new de novo method, the de novo multi-omics pathway analysis (DMPA), to model and combine omics data into regulatory complexes and pathways. DMPA was validated with publicly available omics data and was found accurate in discovering protein-protein interactions, kinase substrate phosphosite relationships, transcription factor target gene relationships, metabolic reactions, epigenetic trait associations and signaling pathways. DMPA was benchmarked against existing module and network discovery and multi-omics integration methods and outperformed previous methods in module and signaling pathway discovery especially when applied to datasets with low sample sizes and zero-inflated data. Transcription factor, kinase, subcellular location and function prediction algorithms were devised for transcriptome, phosphoproteome and interactome regulatory complexes and pathways, respectively. To apply DMPA in a biologically relevant context, interactome, phosphoproteome, transcriptome and proteome data were collected from analyses carried out using melanoma cells to address gamma-secretase cleavage-dependent signaling characteristics of the receptor tyrosine kinase TYRO3. The pathways modeled with DMPA reflected both the predicted function and the direction of the predicted function in validation experiments.

Article activity feed

  1. eLife assessment:

    This manuscript describes development of a new algorithm for integrative analysis of multi-omics data. This work should be of potential interest to scientists performing bioinformatic pathway discovery in multi-omic datasets especially those that relate to signaling.

  2. Reviewer #1 (Public Review):

    Vaparanta et al propose a new bioinformatic algorithm for pathway discovery from multi-omics data sources at one time point, and validate some of their algorithm's predictions using functional experiments. The authors should be commended for their detailed experimental work and comprehensive data collection around TYRO3 signaling in melanoma, which will likely be of value to that field. They also provide a mature software package that is well documented for implementing their bioinformatic methods. The reviewer's experience with the software was that it is computationally efficient/fast with well written code. The biological data (both multiomics and functional validation studies) will be of interest to melanoma research as well as scientists interested in TYRO3 signaling.

    At this time, however, the bioinformatics algorithm proposed is of unclear utility to the broader multiomics community for the following reasons:

    First, the algorithm itself has numerous hyperparameters, which can make it challenging to use and potentially highly sensitive to these user inputs. Just the regulatory complex inference step has 10 hyperparameters/settings required to be selected.

    Second, the algorithm is presented in an ad hoc manner without mathematical/statistical justifications of the many design decisions and steps in the analysis. For example, the authors write "The inference of regulatory complexes from the combined score follows the nearest neighbor principle, assuming that while a single high combined score can be random chance, the combination of combined scores between 3 cell signaling molecules would be predictive". It is mathematically unclear that this is true, and thus this reviewer attempted to test the algorithm using simulated uncorrelated Gaussian noise (see code/outputs at end of the review) in 10K genes and 10 samples using a best attempt at hyperparameter selection per the code comments and documentation. It appears that nearly 1/3 of all genes (i.e., 3205 of 10K) were erroneously grouped into complexes (assuming no mistakes in reviewer's usage of the code). In general, "unbiased" pathway analysis in multiomics that is not relying on prior knowledge will require solving the extraordinarily challenging task of estimating a very large covariance matrix from statistically small sample sizes. This puts the method at high risk of producing spurious results.

    Third, pathway analysis has long been a bioinformatic goal in the literature, with the authors citing a landmark paper for the WGCNA method from 2008. As such, there are numerous and long-standing discussions in the literature regarding challenges of pathway analysis (i.e., omics data often has dimensionality D far larger than sample size N, and correlation matrix estimation requires D^2 >> N parameters to be estimated) and its potential for spurious correlations. Some authors use sophisticated statistical tools (e.g., "Biological network inference using low order partial correlation" 2014, "Learning Large‐Scale Graphical Gaussian Models from Genomic Data" 2005, "Incorporating prior knowledge into Gene Network Study" 2013) to attempt to deal with this issue. Furthermore, the authors indicate that their approach is the first to attempt pathway analysis in multi-omics setting, stating "Integrative approaches combining more than one robust molecular association measure, however, have not been explored", but one can find attempts such as "An Integrative Transcriptomic and Metabolomic Study of Lung Function in Children With Asthma" to build on WGCNA for work in multiomics datasets. The 2020 review paper "Metabolomics and Multi-Omics Integration: A Survey of Computational Methods and Resources" seems to identify multiple published methods dealing with pathway estimation in multiomics datasets. As the paper stands, this reviewer cannot adequately assess the impact of the proposed bioinformatic algorithm and its results against the existing body of literature for pathway inference.

  3. Reviewer #2 (Public Review):

    The authors describe a bioinformatic platform that allows for unbiased pathway analysis from multiomics data. The concept is based on correlation, stoichiometry scores and their combination to evidence interaction between two proteins, transcripts or phosphosites in an omic dataset. This platform was developed and validated on both previously published and in house omics data. I really appreciate that the paper is well written and clear, and I would like to acknowledge the amount of work generated to produce the in-house dataset.

  4. Author Response:

    Reviewer #1 (Public Review):

    Vaparanta et al propose a new bioinformatic algorithm for pathway discovery from multi-omics data sources at one time point, and validate some of their algorithm's predictions using functional experiments. The authors should be commended for their detailed experimental work and comprehensive data collection around TYRO3 signaling in melanoma, which will likely be of value to that field. They also provide a mature software package that is well documented for implementing their bioinformatic methods. The reviewer's experience with the software was that it is computationally efficient/fast with well written code. The biological data (both multiomics and functional validation studies) will be of interest to melanoma research as well as scientists interested in TYRO3 signaling.

    The authors wish to thank the Reviewer for the positive comments.

    At this time, however, the bioinformatics algorithm proposed is of unclear utility to the broader multiomics community for the following reasons:

    First, the algorithm itself has numerous hyperparameters, which can make it challenging to use and potentially highly sensitive to these user inputs. Just the regulatory complex inference step has 10 hyperparameters/settings required to be selected.

    We have now reduced the number of parameters in the code by automating the choice for 2 of the parameters. The manuscript is now accompanied by a sensitivity analysis on all the key parameters in the code (new Supplementary Figures 5-11) and we have created a script to inform the choice of the key parameter S (suggest parameter S value for regulatory complex inference, new Supplementary Figure 10). We have additionally thoroughly revised the accompanying documentation in helping the user choose the right settings for their datasets (available in Mendeley data: https://data.mendeley.com/datasets/m3zggn6xx9/draft?a=71c29dac-714e-497e-8109-5c324ac43ac3).

    Second, the algorithm is presented in an ad hoc manner without mathematical/statistical justifications of the many design decisions and steps in the analysis. For example, the authors write "The inference of regulatory complexes from the combined score follows the nearest neighbor principle, assuming that while a single high combined score can be random chance, the combination of combined scores between 3 cell signaling molecules would be predictive". It is mathematically unclear that this is true…

    We have now tested the effect of the design decisions of the algorithm on the ability to discover known associations in omics datasets (new Supplementary Figure 4). Adhering to the design decision of the algorithm greatly improves the amount of known associations found in real omics data.

    …and thus this reviewer attempted to test the algorithm using simulated uncorrelated Gaussian noise (see code/outputs at end of the review) in 10K genes and 10 samples using a best attempt at hyperparameter selection per the code comments and documentation. It appears that nearly 1/3 of all genes (i.e., 3205 of 10K) were erroneously grouped into complexes (assuming no mistakes in reviewer's usage of the code). In general, "unbiased" pathway analysis in multiomics that is not relying on prior knowledge will require solving the extraordinarily challenging task of estimating a very large covariance matrix from statistically small sample sizes. This puts the method at high risk of producing spurious results.

    The Reviewer raises an important topic that should be considered in de novo analyses. However, the test dataset the reviewer used is not truly representative of the omics datasets that should be used to evaluate the performance of the algorithm. First, the algorithm should be only used with positive expression values due to the way the stoichiometry score is calculated. This is now more clearly indicated in the accompanying documentation (available in Mendeley data: https://data.mendeley.com/datasets/m3zggn6xx9/draft?a=71c29dac-714e-497e-8109-5c324ac43ac3). The Gaussian noise used by the reviewer does not represent any positive expression values of any omics datasets.

    Second, the way the algorithm is constructed it will try to find an association to all features in the dataset if so instructed by the parameters. To this end, we have now added a new parameter (parameter S) into the algorithm to better control this setting. If correctly used in the test dataset used by the reviewer the algorithm now returns 0 complexes. The authors also wish to point out that they strongly believe that the amount of features in the dataset that have no real association with other features in real omics data is very low since most intracellular molecules have common upstream regulators. This poses a problem only if the dataset has a very limited amount of features.

    Third, it seems to the authors that instead of testing the limits of the algorithm with totally randomized data, it would be more valuable to assess whether the algorithm can find true positives among randomized data. To this end we estimated the true positive and false positive rate with normally, negative binomial and beta distributed simulated data (new Supplementary Figures 7-9). Indeed, the algorithm can discover only the true positives among the false positives as long as the S parameter is not set too low. We now provide a separate script (suggest parameter S value for regulatory complex inference, new Supplementary Figure 10) that will help the user to choose the parameter S for their data so that the amount of false positives in the inference is minimized.

    Fourth, the data produced by the standard normal distribution has a relatively low variance, already 68% values fall between -1 and 1 and 95% values between -2 and 2. If you simulate 10000 random rows with a sample size of 10 of such low variance parameter you are at high chance of creating highly correlating rows that actually would be representative of true positives in the dataset due to the generally high variation within omics data. Therefore, it is exceedingly hard to interpret whether the features were erroneously assigned into complexes or not because the chosen simulation method could have by chance created associations that represent true positives in the dataset.

    Fifth, we also analyzed the standard normal distributed simulated data with WGCNA, which is still the most widely used module discovery method. WGCNA assigned almost all the features into modules. However, I think it is clear due to the wide us that the analysis still can offer valuable insight into biological processes. Therefore, the authors are not sure how concerned they should be about the results of this test.

    Third, pathway analysis has long been a bioinformatic goal in the literature, with the authors citing a landmark paper for the WGCNA method from 2008. As such, there are numerous and long-standing discussions in the literature regarding challenges of pathway analysis (i.e., omics data often has dimensionality D far larger than sample size N, and correlation matrix estimation requires D^2 >> N parameters to be estimated) and its potential for spurious correlations. Some authors use sophisticated statistical tools (e.g., "Biological network inference using low order partial correlation" 2014, "Learning Large‐Scale Graphical Gaussian Models from Genomic Data" 2005, "Incorporating prior knowledge into Gene Network Study" 2013) to attempt to deal with this issue.

    The authors agree that if by spurious the Reviewer means non causal indirect associations like in the paper by Zuo et al. (Zuo et al., 2014. Biological network inference using low order partial correlation. Methods 69:266-73. doi: 10.1016/j.ymeth.2014.06.010.), then, indeed, the algorithm has not been designed to find directed networks. Instead, the algorithm has been designed to find common upstream regulators.

    Furthermore, the authors indicate that their approach is the first to attempt pathway analysis in multi-omics setting, stating "Integrative approaches combining more than one robust molecular association measure, however, have not been explored", but one can find attempts such as "An Integrative Transcriptomic and Metabolomic Study of Lung Function in Children With Asthma" to build on WGCNA for work in multiomics datasets.

    Indeed, the Reviewer is correct that correlation networks and WGCNA have been previously used with multi-omics datasets. What the authors meant to convey is that these previous approaches rely only on one measure of molecular association, which in the case of correlation networks is correlation and WGCNA covariation, while our method is the first that combines two measures of molecular association, the correlation and stoichiometry score. We have now amended the sentence in the manuscript (lines 51-52).

    The 2020 review paper "Metabolomics and Multi-Omics Integration: A Survey of Computational Methods and Resources" seems to identify multiple published methods dealing with pathway estimation in multiomics datasets. As the paper stands, this reviewer cannot adequately assess the impact of the proposed bioinformatic algorithm and its results against the existing body of literature for pathway inference.

    We have now benchmarked our method against existing module discovery, network and multi-omics integration methods and provide evidence that our method outperforms these methods (new Figure 4).

    Reviewer #2 (Public Review):

    The authors describe a bioinformatic platform that allows for unbiased pathway analysis from multiomics data. The concept is based on correlation, stoichiometry scores and their combination to evidence interaction between two proteins, transcripts or phosphosites in an omic dataset. This platform was developed and validated on both previously published and in house omics data. I really appreciate that the paper is well written and clear, and I would like to acknowledge the amount of work generated to produce the in-house dataset.

    The authors wish to thank the Reviewer for the encouraging words.