Unveiling Gene Perturbation Effects through Gene Regulatory Networks Inference from single-cell transcriptomic data
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Review Commons)
Abstract
Physiological and pathological processes are governed by a network of genes called gene regulatory networks (GRNs). By reconstructing GRNs, we can accurately model how cells behave in their natural state and predict how genetic changes will affect them. Transcriptomic data of single cells are now available for a wide range of cellular processes in multiple species. Thus, a method building predictive GRNs from single-cell RNA sequencing (scRNA-seq) data, without any additional prior knowledge, could have a great impact on our understanding of biological processes and the genes playing a key role in them. To this aim, we developed IGNITE (Inference of Gene Networks using Inverse kinetic Theory and Experiments), an unsupervised machine learning framework designed to infer directed, weighted, and signed GRNs directly from unperturbed single-cell RNA sequencing data. IGNITE uses the GRNs to generate gene expression data upon single and multiple genetic perturbations. IGNITE is based on the inverse problem for a kinetic Ising model, a model from statistical physics that has been successfully applied to biological networks. We tested IGNITE on murine pluripotent stem cells (PSCs) transitioning from the naïve to formative states. Using as input only scRNA-seq data of unperturbed PSCs, IGNITE simulated single and triple gene knockouts. Comparison with experimental data revealed high accuracy, up to 74%, outperforming currently available methods. In sum, IGNITE identifies predictive GRNs from scRNA-seq data without additional prior knowledge and faithfully simulates single and multiple gene perturbations. Applications of IGNITE range from studying cell differentiation to identifying genes specifically active under pathological conditions.
Article activity feed
-
Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.
Learn more at Review Commons
Reply to the reviewers
The authors do not wish to provide a response at this time.
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #3
Evidence, reproducibility and clarity
Summary
The manuscript presents IGNITE (Inference of Gene Networks using Inverse kinetic Theory and Experiments), an unsupervised machine learning framework for constructing gene regulatory networks from single-cell RNA sequencing (scRNA-seq) data. IGNITE utilizes a kinetic inverse Ising model to infer gene interactions from binarized expression data and can predict genetic perturbation effects, such as those from knockout experiments. Although the application of inverse Ising models to network reconstruction is not entirely novel, IGNITE's specific implementation and its application to single-cell RNA sequencing data represent a …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #3
Evidence, reproducibility and clarity
Summary
The manuscript presents IGNITE (Inference of Gene Networks using Inverse kinetic Theory and Experiments), an unsupervised machine learning framework for constructing gene regulatory networks from single-cell RNA sequencing (scRNA-seq) data. IGNITE utilizes a kinetic inverse Ising model to infer gene interactions from binarized expression data and can predict genetic perturbation effects, such as those from knockout experiments. Although the application of inverse Ising models to network reconstruction is not entirely novel, IGNITE's specific implementation and its application to single-cell RNA sequencing data represent a new development. The method is tested on the transition from naive to formative states in murine pluripotent stem cells, a system the authors are highly knowledgeable about, and its performance is compared to state-of-the-art alternative methods.
Major concerns
My concern regards the generality of the method, particularly the entire pipeline presented, and the fairness of the performance comparison. These concerns can be easily addressed by the authors by better explaining their choices and their general applicability, and by toning down the conclusions about the comparison with existing inference methods.
The pre-processing steps are extensive, and their rationale is not always clear, though the results heavily depend on this analysis. Several steps appear to involve arbitrary choices optimized for specific outcomes, potentially introducing biases. The authors should better explain the rationale behind their choices to mitigate these concerns.
Specifically, part of the pipeline seems to be built to reproduce a specific expression pattern of 24 genes that some of the authors discovered in a previous paper. Although this prior knowledge could be useful and relevant in this specific system, it could limit the generality of the method. For example, the authors selected approximately 2000 genes based on prior knowledge and used a combination of t-SNE and UMAP for dimensionality reduction (although the two techniques have a similar goal). This specific combination seems to reproduce the pseudotime alignment the authors were expecting to find, but such prior information might not be available in general. Therefore, feature selection and the methods used to project data need more justification, especially if the goal is to create a general tool applicable across different biological systems.
Analogously, the clustering seems manually adjusted to match known expression patterns of 24 relevant genes, rather than being the result of an optimized clustering method. Additionally, the clusters overlap with different time points, raising concerns about potential batch effects. These issues should be addressed to strengthen the validity of the method.
The claims about the comparison with existing methods should be toned down. While the comparisons are useful and interesting, they might be biased due to the method's fine-tuning for the specific system studied. The claim that the model requires only scRNA-seq data is misleading, as strong prior biological knowledge was used to select, for example, the genes analyzed.
Significance
The manuscript is scientifically sound, clearly written, and deserves publication. The proposed method is quantitative, novel, theoretically grounded, and was tested in detail with appropriate null models and statistical methods. Moreover, IGNITE can be applied to various biological systems as the availability of scRNA-seq datasets is continuously growing. The paper will be of interest to a broad community of computational biologists and biology labs interested in gene regulation using scRNA-seq data.
The limitation, in my opinion, is the method's (particularly the pre-processing pipeline) fine-tuning for the specific biological system tested. Testing IGNITE on another biological system without pre-selected pre-processing steps or detailed biological priors would be more convincing and make the paper's conclusions much stronger. The comparison with other methods also may be slightly biased due to this fine-tuning.
My background is in statistical physics, with expertise in biological physics, specifically in mathematical modeling and data analysis in molecular biology.
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
Corridori et al introduce IGNITE, a computational framework to infer gene regulatory networks (GRNs) from scRNA-seq data leveraging the kinetic Ising model, which can be used to simulate synthetic gene expression and perform in-silico knockout experiments. Other similar frameworks exist, but none combine these three aspects together. The authors have generated a scRNA-seq of murine ESCs differentiation which they use to compare their method with others. Specifically they show that they can infer known regulatory interactions, that they can generate similar data than the original and that it can potentially predict gene expression …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
Corridori et al introduce IGNITE, a computational framework to infer gene regulatory networks (GRNs) from scRNA-seq data leveraging the kinetic Ising model, which can be used to simulate synthetic gene expression and perform in-silico knockout experiments. Other similar frameworks exist, but none combine these three aspects together. The authors have generated a scRNA-seq of murine ESCs differentiation which they use to compare their method with others. Specifically they show that they can infer known regulatory interactions, that they can generate similar data than the original and that it can potentially predict gene expression changes in transcription factor knock-out perturbations.
Major comments:
- Many of the authors' claims are backed by qualitative results and not properly quantified. In Fig2, authors qualitatively compare intra gene correlations between genes for the original data and their prediction. Instead of just visualizing they should compute and report the Spearman correlation between the original expression and the predicted one. The Fraction of Agreement is not a good metric to compare knockout predictions since it is completely dependent on the class imbalance of signs, for example if the selected genes are 75% positive and 25% negative, a naive predictor that only outputs positive predictions will still have a high score. Instead, the authors should quantify this with Spearman correlation or RMSE and compare across methods. In FigS4a-b the authors qualitatively claim that other methods could not predict the expected cell composition, which they should quantify and report the values across methods. When comparing against the ground truth network, the fraction of correctly inferred interactions is technically the same as precision but is ignoring recall. I suggest the authors compute precision, recall and a combined F1 score to compare the evaluated methods. Authors claim that the method is scalable to a larger number of genes but no data is provided, they should show how their method compares to others when using a different number of cells and number of genes at memory usage and running time.
- The authors need to better describe which tests were performed when talking about significance, which thresholds and which corrections, if any, were employed.
- To reduce the number of dimensions of scRNA-seq data the authors use t-SNE and then from the obtained result UMAP to project the data into a lower dimensional space. This is fundamentally wrong since distances are not well preserved in t-SNE. Instead the authors should first employ PCA and then UMAP. Additionally, the authors use UMAP distances in the Slingshot pseudotime calculation. Similar to t-SNE, UMAP distances have no real meaning and should only be used for visualization purposes. Instead, the authors should provide Slingshot the obtained PCA embeddings.
- Dictys (PMID: 37537351) is a known GRN inference method that also can simulate gene expression but is missing in the benchmark, the authors should add it to the method comparison.
- The current manuscript is not reproducible since it is missing the method's code, the code to reproduce the figures and the generated scRNA-seq data.
- Authors claim that the method is scalable to a larger number of genes but no data is provided to back this claim. They should show how their method compares to others when using a different number of cells and number of genes.
Minor points:
- In the introduction, authors mention multimodal GRN inference methods but do not provide any references.
- In Table 1, CellOracle is annotated as not being able to do multiple KO which is wrong. Additionally, the authors mention that IGNITE uses no prior knowledge which is not really true since it requires pseudotime ordering. The authors should add a column to Table 1 whether methods require pseudotime.
- It is unclear what the dashed arrow of Fig1b means. Moreover, plotting gene expression values on top of UMAPs can be misleading, instead authors should plot the gene expression distributions binned by pseudotime.
- The authors report a p-value of 1.04x10-171 which is below detection limit (see PMID: 30921532). Authors should change it to an interval such as p < 2.2×10-16.
- To make CellOracle results easier to interpret and more comparable, authors should run it at the atlas level instead of at the cell type level, this way generating only one GRN. This can be achieved by assigning the same cluster label to all cells.
- Experimental values in FigS3b seem to have been repeated and do not match the previous ones for IGNITE and SCODE.
- It is unclear what the different circles mean in Fig5b.
Significance
This manuscript is an incremental and methodological work for specialized audiences. Its strengths are that the authors employ kinetic Ising model for GRN inference and that they provide a single framework capable of inferring, simulating and perturbing gene expression. The main limitations are that the claims should be better quantified and that the code and data need to be made accessible.
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
Summary
Corridori and colleagues propose IGNITE, a novel method to recover Gene Regulatory Networks (GRN) from single cell RNA-sequencing (scRNA-seq) data. Their method solves the inverse Ising problem generating a cohort of candidate GRN optimising it to minimise the difference to the input expression matrix. Authors report the IGNITE is able to predict wild type data and simulate both single and multiple gene knockouts. Authors benchmark this method on a in-house data set of differentiating pluripotent stem cells (PSC). They focus on a small set of genes known to be involved in PSC differentiation into formative cells. Authors …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
Summary
Corridori and colleagues propose IGNITE, a novel method to recover Gene Regulatory Networks (GRN) from single cell RNA-sequencing (scRNA-seq) data. Their method solves the inverse Ising problem generating a cohort of candidate GRN optimising it to minimise the difference to the input expression matrix. Authors report the IGNITE is able to predict wild type data and simulate both single and multiple gene knockouts. Authors benchmark this method on a in-house data set of differentiating pluripotent stem cells (PSC). They focus on a small set of genes known to be involved in PSC differentiation into formative cells. Authors benchmark IGNITE against state of the art tools (SCODE, MaxEnt and CELLORACLE). They evaluate IGNITE ability to predict wild type gene expression by comparing their data with experimental data and with SCODE. They conclude the tool has generative capacity comparable with SCODE. They also evaluate IGNITE ability to recover known interactions with respect to other tools without finding it to significantly outperform them.
Major comments
- Are the key conclusions convincing?
Conclusions appear convincing although model generalizability could be shown in a more thorough manner. For instance, analysing some other publicly available dataset could help demonstrate hyperparameters effects on GRN predictions and their robustness across different experiments.
- Should the authors qualify some of their claims as preliminary or speculative, or remove them altogether?
Claims are well supported by data.
- Would additional experiments be essential to support the claims of the paper? Request additional experiments only where necessary for the paper as it is, and do not ask authors to open new lines of experimentation.
I think the work would benefit from an additional benchmark on a different cellular system. This experiment would show how hyperparameters generalise across datasets and would provide potential users insights how to tweak them.
Also, how does the model scale with the number of genes? A benchmark on computation time and resources required to infer GRN of growing size would be valuable in the adoption of this tool.
In addition, I think the GRN comparison benchmark presented in section (3.4) would benefit from a quantitative discussion. Authors show inferred GRNs in Figure 4 and S5. For instance, measuring matrix similarity (when appropriate) would help understanding how predicted GRN compare. I understand authors attempt to do so by focusing on validated interactions and computing the fraction of correctly inferred interactions (FCI) but I think a measurement of the overall similarity (eg. Pearson correlation) would add on this.
Another comment regards the dependency between Correlation Matrices Distance (CMD) and FCI, shown in Figure 5. I understand that IGNITE GRN that maximise FCI are not the same that minimise CMD. However, it looks like GRN that maximise FCI have higher value in terms of biological information. I wonder whether optimization for one or the other metric could be left to the end user as a tunable parameter.
Authors should discuss why the expression of some genes does not follow the expected trends (Fig 1C vs Fig S1A). Out of the 24 genes they select for their analysis, at least four do not follow the expected trends: Sox2, according to literature, is a Naive gene, however, in Figure 1C its gene expression pattern is more similar to Formative late genes. Other genes with similar "unexpected" patterns are Zic3, Etv4 and Sall4.
Are the suggested experiments realistic in terms of time and resources? It would help if you could add an estimated cost and time investment for substantial experiments.
I think suggested experiments are doable as long as authors get publicly available data, i.e. the in-house dataset they generated for this study is enough to show applicability. For example datasets analysed in SCODE paper (https://doi.org/10.1093/bioinformatics/btx194) could be used as second benchmark. The point of applying the tool to another dataset is to show how it generalises across different biological systems, experiments and, potentially, sequencing technologies.
- Are the data and the methods presented in such a way that they can be reproduced?
The methods section is really clear. To enable reproducibility both raw scRNA-seq data, the IGNITE source code and code written to benchmark it should be released in the public domain in appropriate repositories (eg. ENA, GitHub, Binder etc).
- Are the experiments adequately replicated and statistical analysis adequate?
Yes.
Minor comments
- Specific experimental issues that are easily addressable.
Related to the Sox2 expression pattern is the binarization shown in Figure 2D. How is it possible that Sox2 is always marked as active? Could the authors clarify how these outlier behaviours emerge and propose mitigation strategies, if any?
In section 5.11.2 it is unclear if xi are in log scale or not. Since the model starts from binarized, log transformed expression values, should not generated ones be in the same scale as the input?
- Are prior studies referenced appropriately?
Yes, referencing is clear.
- Are the text and figures clear and accurate?
Yes, figures appear to be clear, readable and well documented both in captions and main text.
- Do you have suggestions that would help the authors improve the presentation of their data and conclusions?
Section 3.3 could be improved by better describing experimental datasets. Only in the methods section it is clearly stated that experimental data for single KO experiments were retrieved from the literature.
Check typesetting:
- parenthesis missing in Eq. 1
- Leftover $ in section 3.1
- Parenthesis missing in Section 3.3
- Misplaced comma in section 5.2.1
Significance
- Describe the nature and significance of the advance (e.g. conceptual, technical, clinical) for the field.
The paper presents a method to infer GRN from scRNA-seq data alone. Applications include GRN prediction and their perturbations. This paper represents a technical advance in the field as it is the first application of the inverse Ising problem GRN inference.
- Place the work in the context of the existing literature (provide references, where appropriate).
The paper itself presents the landscape of GRN inference tools using scRNA-seq data: SCODE, MaxEnt and CELLORACLE. More tools exist, for instance SCENIC (https://doi.org/10.1038/nmeth.4463) mainly relies on co-expression matrices. Other tools exist but require additional data types e.g. GRaNIE and GRaNPA (https://doi.org/10.15252/msb.202311627) leverage on physical interaction data (ATAC-seq, ChIP-seq). Similarly DeepFlyBrain uses deep neural networks to infer eGRN in Drosophila (https://doi.org/10.1038/s41586-021-04262-z). The value of tools like IGNITE and its competitors is that they do not require additional data types, which, in turn, helps in controlling experimental costs.
- State what audience might be interested in and influenced by the reported findings.
The paper might be of interest to biologists interested in regulation of gene expression. The tool might turn out to be useful in planning experimental work by guiding the choice of perturbations to introduce in experimental systems.
- Define your field of expertise with a few keywords to help the authors contextualize your point of view. Indicate if there are any parts of the paper that you do not have sufficient expertise to evaluate.
I am a computational biologist.
I have no sufficient expertise to evaluate the mathematical details of the method.
-