The probability of edge existence due to node degree: a baseline for network-based predictions

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Important tasks in biomedical discovery such as predicting gene functions, gene–disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network’s specific connections using network permutation to generate features that depend only on degree. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Researchers seeking to predict new or missing edges in biological networks should use our permutation approach to obtain a baseline for performance that may be nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).

Article activity feed

  1. AbstractImportant tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network’s specific connections. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Degree’s predictive performance diminishes when the networks used for training and testing—despite measuring the same biological relationships—were generated using distinct techniques and hence have large differences in degree distribution. We introduce the permutation-derived edge prior as the probability that an edge exists based only on degree. The edge prior shows excellent discrimination and calibration for 20 biomedical networks (16 bipartite, 3 undirected, 1 directed), with AUROCs frequently exceeding 0.85. Researchers seeking to predict new or missing edges in biological networks should use the edge prior as a baseline to identify the fraction of performance that is nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).

    This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giae001), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2: Linlin Zhuo

    In this manuscript, the authors introduce a network permutation framework to quantify the effects of node degree on edge prediction. The importance of degree in the edge detection task is self-evident, and the quantification of this effect is undoubtedly groundbreaking. The experimental results on a variety of datasets demonstrate the advanced nature of the method proposed by the authors. However, some parts require further explanation from the authors and can be considered for acceptance in a later stage.

    1.The imbalance of the degree distribution has a significant impact on the results of the edge detection task. In this manuscript, the author proposes a framework to quantify this impact. It is important to note that the manuscript does not explicitly mention the specific form in which the quantification is reflected, such as whether it is presented as an indicator or in another form. Therefore, further explanation from the author is needed to clarify this aspect.

    2.The authors propose that researchers employ marginal priors as a reference point to discern the contributions attributed to node degree from those arising from specific performance. It would be helpful if the authors could elaborate further on the methodology or provide a sample demonstration to clarify the implementation of this approach.

    3.For the XSwap algorithm, I wonder that if the authors could provide a more detailed explanation of its workings, including a step-by-step implementation of the improved XSwap. Furthermore, it would be beneficial if the authors could highlight the significance of the improved XSwap algorithm in biomedical tasks.

    4.The author presents the pseudocode of the XSwap algorithm in Figure 2, along with the improved pseudocode after the author's enhancements. Both pseudocodes are accompanied by explanatory text. However, I believe that expressing them in the form of a figure would make it more visually appealing and intuitive.

    5.The authors introduce the edge prior to quantify the probability of two nodes being connected solely based on their degree. I request the authors to provide a detailed explanation of the specific implementation of the edge prior.

    6.In the "Prediction tasks" section, the author utilizes three prediction tasks to assess the performance of the edge prior. It is recommended to segment correctly for better display of the content.

    7.The focus of the article might not be prominent enough. It is advisable for the author to provide further elaboration on the advanced nature of the proposed framework and its significance in practical tasks. This would help emphasize the main contributions of the research and its relevance in real-world applications.

  2. AbstractImportant tasks in biomedical discovery such as predicting gene functions, gene-disease associations, and drug repurposing opportunities are often framed as network edge prediction. The number of edges connecting to a node, termed degree, can vary greatly across nodes in real biomedical networks, and the distribution of degrees varies between networks. If degree strongly influences edge prediction, then imbalance or bias in the distribution of degrees could lead to nonspecific or misleading predictions. We introduce a network permutation framework to quantify the effects of node degree on edge prediction. Our framework decomposes performance into the proportions attributable to degree and the network’s specific connections. We discover that performance attributable to factors other than degree is often only a small portion of overall performance. Degree’s predictive performance diminishes when the networks used for training and testing—despite measuring the same biological relationships—were generated using distinct techniques and hence have large differences in degree distribution. We introduce the permutation-derived edge prior as the probability that an edge exists based only on degree. The edge prior shows excellent discrimination and calibration for 20 biomedical networks (16 bipartite, 3 undirected, 1 directed), with AUROCs frequently exceeding 0.85. Researchers seeking to predict new or missing edges in biological networks should use the edge prior as a baseline to identify the fraction of performance that is nonspecific because of degree. We released our methods as an open-source Python package (https://github.com/hetio/xswap/).

    This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giae001), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Babita Pandey

    The manuscript "The probability of edge existence due to node degree: a baseline for network-based predictions" presents novel work. But some of the sections are written very briefly, so it is difficult to understand. The section that needs revision are: Degree-grouping, The edge prior encapsulates degree, Degree can underly a large fraction of performance and Analytical approximation of the edge prior. The result section needs revision.

    Some other concerns are: Academic adhar, Jaccard coefficient, preferential atachment etc are link prediction methods. Why auther has termed them as edge prediction features.