Unified knowledge-driven network inference from omics data

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Analysing omics data requires computational methods to effectively handle its complexity and to derive meaningful hypotheses about molecular mechanisms. While data-driven statistical and machine learning methods can identify patterns from omics data across multiple samples, they typically require a large number of samples and they often lack interpretability and alignment with existing biological knowledge. In contrast, knowledge-based network methods integrate molecular data with prior knowledge to provide results that are biologically interpretable, but they lack both a unified mathematical framework, leading to ad-hoc solutions specific to particular data types or prior knowledge, limiting their generalisability, and a common modelling interface for programmatic manipulation, restricting method extensions. Furthermore, existing methods generally cannot perform joint network inference across multiple samples or conditions, which restricts their capacity to capture shared mechanisms, making these methods more sensitive to noise and prone to overfitting. To address these limitations, we introduce CORNETO (Constrained Optimisation for the Recovery of NETworks from Omics), a unified framework for knowledge-driven network inference. CORNETO redefines the joint inference task as a constrained optimisation problem with a penalty that induces structured sparsity, allowing for simultaneous network inference across multiple samples. The framework is highly flexible and supports a wide variety of prior knowledge networks—undirected, directed and signed graphs, as well as hypergraphs—enabling the generalisation and improvement of many network inference methods, despite their seemingly different assumptions. We demonstrate its utility by presenting novel extensions of methods for signalling, metabolism and protein-protein interactions. We show how these new methods improve the performance of traditional techniques on a diverse set of biological tasks using simulated and real data. CORNETO is available as an open-source Python package ( github.com/saezlab/corneto ), facilitating researchers in extending, reusing, and harmonising methods for network inference.

Article activity feed

  1. This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/14291902.

    This was reviewed by my lab group. Overall we really liked this paper. We have computational biology experience but not network biology experience.

    • We had some difficulty in tracking which algorithm or network solving case is currently considered across all the figures

    • It was unclear how nodes for input and output are selected in all cases as pictured in figure 2? and why would they be different across samples as shown in that figure

    • Supplementary Figure 1 seems to be redundantly named as "Figure 1"

    • How are termini decided throughout each example? Some cases stated this explicitly, especially figure 6, but other in many cases it was unclear

    • What is the value in Figure 4a?

      • use gene names instead of SGD codes

      • if they are trying to show that the conditions they are sampling across the diversity of responses that should be stated in the text

    • Figure 4b, I don't have good intuition about what these values of M/A mean, can they show z-scores instead?

      • callouts for fig 4b are also missing in the legend and text

      • should use gene names instead of SGD codes

    • Figure 4c, reporting the improvement in precision and recall seems to be hiding the absolute value in precision and recall, the actual metric values should be reported. For example a 30% improvement going from 10% precision to 13% may not be meaningful

    • Not clear how the 20% hold out experiment was performed, are they saying that if the selected network, then included the high abundance enzymes those were true positives and excluding the low expressed enzymes were true negatives, etc?

    • The metabolic interpretation of Figure 4d is lacking detail, for example, the claim that "Most of these genes seem to have an indirect connection with the metabolism of fatty acids." For example ACO1 and PFK2 are essential glycolysis proteins

    • Somehow more directly show the nature of the true positives and false positives from those held out 20% sets

    • Figure 4: Confused about how the model is setup here mathematically, inputting quantities of enzymes in these pathways and using PKN with reactions, for example what are the input and output nodes?

    • The order of the inset vs heatmap in figure 5a doesn't match

    • Figure 5: We believe that the PKN contributes the network shape, but it's unclear to us if the signs of the connections are derived from the network as suggested by panel e

    • Fig 6 in general seems very preliminary with very little explanation, for the rest of the paper they were more systematic with comparisons to other tools

    • Figure 6: Unclear why the need to bring up Fragpipe and why they didn't use the site level quantities already provided from the paper. State why they needed to re-analyze the raw proteomics data?

    • Figure 6 panels are explained out of order which is confusing. (a>b>e>d>c).

    • Figure 6e is missing a sense of what novel information is gained and how it may be different from PHONEMES.

    • Fig 6d "We selected the most variable proteins as terminal nodes": State why is this a reasonable approach? We assume that these may be influenced by multiple factors. 

    • Figure 6: unclear if all these nodes represent transcripts because only some nodes are labeled as genes

    • Figure 1 – Revise typo in diagram: Import Prior Knowledge Network

    • Figure 3A and 3B – Adjust y-axis up to 200 or input a title with a description like the figure legend

    • During optimization problems,  how are the inverse relationships in down-regulated proteins when compared to increased mRNA levels (vise versa). Or can this be a source of false negatives/positives?

    • Compared to the shortest path supplement example, its unclear how node values would be used in this case. Are they multiplied by edge weights?

    Competing interests

    The authors declare that they have no competing interests.