EssSubgraph improves performance and generalizability of mammalian essential gene prediction with large networks
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (GigaScience)
Abstract
Predicting essential genes is important for understanding the minimal genetic requirements of organisms, identifying disease-associated genes, and discovering potential drug targets. Wet-lab experiments for identifying essential genes are time-consuming and labor-intensive. Although various machine learning methods have been developed for essential gene prediction, both systematic testing with large collections of gene knockout data and rigorous benchmarking for efficient methods are very limited to date. Furthermore, current graph-based approaches require learning the entire gene interaction networks, leading to high computational costs, especially for large-scale networks. To address these issues, we propose EssSubgraph, an inductive representation learning method that integrates graph-structured network data with omics features for training graph neural networks. We used comprehensive lists of human essential genes distilled from the latest collection of knockout datasets for benchmarking. When applied to essential gene prediction with multiple types of biological networks, EssSubgraph achieved superior performance compared to existing graph-based and other models. The performance is more stable than other methods with respect to network structure and gene feature perturbations. Because of its inductive nature, EssSubgraph also enables predicting gene functions using dynamical networks with unseen nodes and it is scalable with respect to network sizes. Finally, EssSubgraph has better performance in cross-species essential gene prediction compared to other methods. Our results show that EssSubgraph effectively combines networks and omics data for accurate essential gene identification while maintaining computational efficiency. The source code and datasets used in this study are freely available at https://github.com/wenmm/EssSubgraph .
Article activity feed
-
AbstractPredicting essential genes is important for understanding the minimal genetic requirements of organisms, identifying disease-associated genes, and discovering potential drug targets. Wet-lab experiments for identifying essential genes are time-consuming and labor-intensive. Although various machine learning methods have been developed for essential gene prediction, both systematic testing with large collections of gene knockout data and rigorous benchmarking for efficient methods are very limited to date. Furthermore, current graph-based approaches require learning the entire gene interaction networks, leading to high computational costs, especially for large-scale networks. To address these issues, we propose EssSubgraph, an inductive representation learning method that integrates graph-structured network data with omics …
AbstractPredicting essential genes is important for understanding the minimal genetic requirements of organisms, identifying disease-associated genes, and discovering potential drug targets. Wet-lab experiments for identifying essential genes are time-consuming and labor-intensive. Although various machine learning methods have been developed for essential gene prediction, both systematic testing with large collections of gene knockout data and rigorous benchmarking for efficient methods are very limited to date. Furthermore, current graph-based approaches require learning the entire gene interaction networks, leading to high computational costs, especially for large-scale networks. To address these issues, we propose EssSubgraph, an inductive representation learning method that integrates graph-structured network data with omics features for training graph neural networks. We used comprehensive lists of human essential genes distilled from the latest collection of knockout datasets for benchmarking. When applied to essential gene prediction with multiple types of biological networks, EssSubgraph achieved superior performance compared to existing graph-based and other models. The performance is more stable than other methods with respect to network structure and gene feature perturbations. Because of its inductive nature, EssSubgraph also enables predicting gene functions using dynamical networks with unseen nodes and it is scalable with respect to network sizes. Finally, EssSubgraph has better performance in cross-species essential gene prediction compared to other methods. Our results show that EssSubgraph effectively combines networks and omics data for accurate essential gene identification while maintaining computational efficiency. The source code and datasets used in this study are freely available at https://github.com/wenmm/EssSubgraph.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf136), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 2: Ju Xiang
This paper proposes an inductive graph neural network model EssSubgraph for prediction of mammalian essential genes by integrating protein-protein interaction (PPI) networks with multi-omics data. Experimental results demonstrate the performance of methods, with additional validation showing effective cross-species prediction and biological consistency of predicted essential genes through functional enrichment analysis. This work is interesting, but some questions need to be clarified before publication. (1)The literature review lacks discussion about inductive vs. transductive graph learning approaches. Expanding this background would better contextualize the model's technical contributions. (2)While PCA dimensions for expression features were optimized (Figure 2A-B), other key hyperparameters like sampling depth (K-hop) deserve similar systematic evaluation to ensure optimal configuration. (3)What is RuLu? How does the author handle the issue of sample imbalance? Does CONCAT mean that two vectors are connected end-to-end to become a vector? If yes, does it mean that the number of rows of W is set to 1 in order to generate the final prediction output? (4)How to perform the sampling of nodes in EssSubgraph? The explanation of 'Subgraph' in the method name is not sufficient. (5)What are 'Edge perturbation' and 'feature perturbations'? How to perform? What is the performance of the algorithm in this article when only the network structure is used or only gene expression data is used? Or say, on the basis of the network, does adding gene expression data bring performance improvements, and vice versa? (6)The computational efficiency analysis focuses on memory usage but omits critical metrics like training time and scalability with respect to batch size or sampling strategies. Is it appropriate to directly compare 'Memory efficiency and network scalability'? The same method may require different amounts of memory and computation time when using different encoding technologies. (7)Minor revisions: --"and can predict identities of genes which can then predict the identities of genes that were either included in the training network or are unseen nodes." --Lines 244-251, "We used the EssSubgraph model mentioned above." The logical relationship here needs to be optimized. --"The model is an inductive deep learning method that generates low-dimensional vector representations for nodes in graphs and can predict identities of genes which can then predict the identities of genes that were either included in the training network or are unseen nodes." It is not clear. --Suggest to supplement statistical data on 'high density'. In terms of existing networks, they generally may not be called high-density. --Placing the perturbation curves of different methods in the same figure is more convenient for comparing the stability of different methods.
-
AbstractPredicting essential genes is important for understanding the minimal genetic requirements of organisms, identifying disease-associated genes, and discovering potential drug targets. Wet-lab experiments for identifying essential genes are time-consuming and labor-intensive. Although various machine learning methods have been developed for essential gene prediction, both systematic testing with large collections of gene knockout data and rigorous benchmarking for efficient methods are very limited to date. Furthermore, current graph-based approaches require learning the entire gene interaction networks, leading to high computational costs, especially for large-scale networks. To address these issues, we propose EssSubgraph, an inductive representation learning method that integrates graph-structured network data with omics …
AbstractPredicting essential genes is important for understanding the minimal genetic requirements of organisms, identifying disease-associated genes, and discovering potential drug targets. Wet-lab experiments for identifying essential genes are time-consuming and labor-intensive. Although various machine learning methods have been developed for essential gene prediction, both systematic testing with large collections of gene knockout data and rigorous benchmarking for efficient methods are very limited to date. Furthermore, current graph-based approaches require learning the entire gene interaction networks, leading to high computational costs, especially for large-scale networks. To address these issues, we propose EssSubgraph, an inductive representation learning method that integrates graph-structured network data with omics features for training graph neural networks. We used comprehensive lists of human essential genes distilled from the latest collection of knockout datasets for benchmarking. When applied to essential gene prediction with multiple types of biological networks, EssSubgraph achieved superior performance compared to existing graph-based and other models. The performance is more stable than other methods with respect to network structure and gene feature perturbations. Because of its inductive nature, EssSubgraph also enables predicting gene functions using dynamical networks with unseen nodes and it is scalable with respect to network sizes. Finally, EssSubgraph has better performance in cross-species essential gene prediction compared to other methods. Our results show that EssSubgraph effectively combines networks and omics data for accurate essential gene identification while maintaining computational efficiency. The source code and datasets used in this study are freely available at https://github.com/wenmm/EssSubgraph.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf136), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 1: Yuchi Qiu
Predicting essential genes are critical for identifying disease-associated genes. In this work, the authors EssSubgraph to predict essential genes by combining PPI and transcriptome data. EssSubgraph utilizes a GraphSAGE structure with subgraph sampling techniques to produce accurate, efficient, and scalable predictions. The method was tested and compared with multiple GNN-based models on 1) essential gene prediction, 2) predictions with randomly permuted node and edge features, and EssSubgraph shows advanced performance in accuracy, efficiency, and scalability. The author also performed GO analysis to show the interpretability of EssSubgraph to pick up genes with critical biological functions. Further analysis in predicting unseen genes and cross-species gene exemplified the strong generalizability. Overall, this work developed a novel and advanced GNN-based model with comprehensive studies. However, some clarifications are necessary to improve the paper readability.
- The authors may give an overview about method motivations. For example, the authors may show method of DepMap and its limitation, then use this as motivation to describe why EssSubgraph is better. It looks like essential genes are very context specific, the authors may clarify what information is used to define essential genes?
- The authors may introduce their method's unique features such as graph sampling, and its modifications to GraphSAGE.
- The GNN model description of EssSubgraph is not clear enough. What kind of graph aggregation is used? Is the aggregation layer coupled with residual layer, and how many layers are used? What is the structure after all aggregation layers? I recommend creating an illustration of network architecture showing all these details.
- Many PPI networks are cell-type- or species-specific. How was those cell-type and species information used in this work?
- Line 150-152: clarification needed.
- Line 222, should "learned linear transformation" be "learnable linear layer"?
-
