Applying causal discovery to single-cell analyses using CausalCell

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    The paper describes an online tool, CausalCell, intended for the analysis of causal links in single-cell datasets. Regarding its significance, this work is timely and important, with potentially broad applications as a generally useful tool. However, there are major concerns about the suitability of the tool for its intended purpose, and the extent of validation in the current manuscript is incomplete.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Correlation between objects is prone to occur coincidentally, and exploring correlation or association in most situations does not answer scientific questions rich in causality. Causal discovery (also called causal inference) infers causal interactions between objects from observational data. Reported causal discovery methods and single-cell datasets make applying causal discovery to single cells a promising direction. However, evaluating and choosing causal discovery methods and developing and performing proper workflow remain challenges. We report the workflow and platform CausalCell ( http://www.gaemons.net/causalcell/causalDiscovery/ ) for performing single-cell causal discovery. The workflow/platform is developed upon benchmarking four kinds of causal discovery methods and is examined by analyzing multiple single-cell RNA-sequencing (scRNA-seq) datasets. Our results suggest that different situations need different methods and the constraint-based PC algorithm with kernel-based conditional independence tests work best in most situations. Related issues are discussed and tips for best practices are given. Inferred causal interactions in single cells provide valuable clues for investigating molecular interactions and gene regulations, identifying critical diagnostic and therapeutic targets, and designing experimental and clinical interventions.

Article activity feed

  1. Author Response

    Reviewer #2 (Public Review):

    Wen et al. developed a useful tool for causal network inference based on scRNA-seq data. The authors comprehensively benchmarked 9 feature selection and 9 causal discovery algorithms using both synthetic data and real scRNA-seq data. Their conclusions regarding the performance of these algorithms on synthetic data are solid and valuable. I believe this tool or platform has the potential to help biologists discover novel cell type-specific signaling pathways or gene regulatory events since there is no prior knowledge (such as known pathway annotations) as inputs. However, several major concerns below need to be addressed to improve the paper.

    1. Current validation of the inferred causal networks using real scRNA-seq datasets seems quite simple and is not sufficient to support the accuracy and reliability of results. Annotations from the STRING database do not contain directions of edges among genes or proteins. However, the edge direction in the inferred network is a crucial aspect to explain the causal relationships. Besides using "spike-in" data, a systematic validation of the inferred network, especially the edge directions, should be provided.

    We have used the data of the five lung cancer cell lines and alveolar cells and the genes in several pathways (in which causal interactions are better annotated) in the KEGG and WikiPathway databases to validate network inference systematically. Please see the responses to the Essential Revisions (for the authors).

    1. In order to illustrate the novel discovery, CausalCell should be further compared to existing gene network construction methods based on scRNA-seq data such as SCENIC (Aibar et al. Nature Methods, 2017).

    (a) We have added a "TF=No/Yes" option to feature selection. If this option is ignored, feature selection is as before. If "TF=Yes" is selected, all feature genes are TFs. If "TF=NO" is selected, all feature genes are non-TFs. With this option, normally, two rounds of feature selection are performed. The first round ("TF=Yes" is selected) selects TFs as feature genes of a response variable (RV), and the second round ("TF=No/Yes" is ignored) selects feature genes as before (feature genes contain both TFs and non-TFs). The user selects genes from the results of two rounds to build input to causal discovery.

    (b) The networks inferred by SCENIC are TF-centered: each TF and its potential target genes form a regulon, it searches for genes co-expressed with a TF (through GENIE3/GRNBoost), and the union of all or some of the regulons forms a network. Thus, SCENIC helps uncover the "one TF->all targets" relationships. Different from SCENIC, this "TF=No/Yes" option provides a target-centered transcription regulation analysis and helps uncover the "all TF->one target" relationships (the target is the response variable). Thus, the two approaches are complementary. Feature selection based on the "TF=No/Yes" option also differs from SCENIC in that no predefined regulons (defined upon "cisTarget" databases) are needed.

    (c) We used SCENIC in our initial analysis of the young and old mouse CD4 T cells (see Figure 5 in Elyahu et al. 2019). In the re-analysis of tumor-infiltrating exhausted CD8 T cells, we find that the "TF=No/Yes" option helps uncover transcription regulation. For example, the transcription factor TOX is reported to regulate PDCD1 critically in mice. When we perform feature selection to identify feature genes of PDCD1, TOX is in the top 50 feature genes in the colorectal cancer dataset but not in the lung and liver cancer datasets (Supplementary file1:Table 1). To re-examine whether TOX critically regulates PDCD1 in the two latter datasets, we perform feature selection with "TF=Yes", and the results are that TOX is a top TF of PDCD1.

    1. The authors should also claim what type of the inferred causal network represent from the biological perspective (e.g. signaling networks or gene regulatory networks?).

    (a) Although methods have been developed specifically for inferring signaling and regulatory networks, whether a network is a signaling network or a gene regulatory network depends on the input data. Also, many proteins and noncoding RNAs function as complexes instead of individually in both kinds of networks, and RNA-seq and scRNA-seq data contain only transcripts. Thus, researchers must infer signaling and gene regulation in cells upon inferred networks.

    (b) The input to causal discovery can be (a) a target gene and its potential TFs, (b) a TF and its potential targets, (c) genes encoding both TFs and non-TFs. Thus, whether an inferred network is signaling or gene regulatory network depends on the input. We have made this clear in the Discussion.

    1. Besides edge direction, an important feature of CausalCell is the determination of edge sign (i.e. activation or inhibition). The authors should describe its related procedures.

    In the revised section "2.5 Causal discovery", we wrote, ""In all inferred causal networks, edges have a sign that indicates activation or repression and have a thickness that indicates CI test's statistical significance. The sign of the edge from A to B is determined by computing a Pearson correlation coefficient between A and B, which is ‘repression’ if the coefficient is negative or ‘activation’ if the coefficient is positive. In most cases, ‘A activating B’ and ‘A repressing B’ correspond to up-regulated A in the case dataset compared with down-regulated B in the control dataset."

    1. The authors did not provide an example of constructing a causal network between cells or cell types, although they mentioned its importance in the Abstract. Such intercellular network examples can distinguish the utility of CausalCell in single-cell data analysis from bulk data analysis.

    Constructing causal networks between cells is a quite different work. We delete this sentence in the manuscript because we are still working on it.

    1. If the control dataset is available, it is currently not clear whether batch effects of the query and control datasets will be removed in the data preprocessing step. Differentially expressed genes cannot be selected correctly if batch effects exist.

    Please see our responses to the second point of Essential Revisions.

  2. eLife assessment

    The paper describes an online tool, CausalCell, intended for the analysis of causal links in single-cell datasets. Regarding its significance, this work is timely and important, with potentially broad applications as a generally useful tool. However, there are major concerns about the suitability of the tool for its intended purpose, and the extent of validation in the current manuscript is incomplete.

  3. Reviewer #1 (Public Review):

    The authors introduce an online tool, CausalCell, to explore causal links in single-cell datasets. The authors investigate the process through examples based on existing data, offer comparisons of different algorithms, and suggest tips about the requirements and limitations of this approach. In my opinion, the main shortcoming is that the authors do not adequately justify whether the methods included in their tool are the most suitable methods for their intended analyses. The lack of a definite "ground truth" or "gold standard" also comes in the way clearly deciding which algorithms perform the best, especially when there are considerable differences between the results of different algorithms.

  4. Reviewer #2 (Public Review):

    Wen et al. developed a useful tool for causal network inference based on scRNA-seq data. The authors comprehensively benchmarked 9 feature selection and 9 causal discovery algorithms using both synthetic data and real scRNA-seq data. Their conclusions regarding the performance of these algorithms on synthetic data are solid and valuable. I believe this tool or platform has the potential to help biologists discover novel cell type-specific signaling pathways or gene regulatory events since there is no prior knowledge (such as known pathway annotations) as inputs. However, several major concerns below need to be addressed to improve the paper.

    (1) Current validation of the inferred causal networks using real scRNA-seq datasets seems quite simple and is not sufficient to support the accuracy and reliability of results. Annotations from the STRING database do not contain directions of edges among genes or proteins. However, the edge direction in the inferred network is a crucial aspect to explain the causal relationships. Besides using "spike-in" data, a systematic validation of the inferred network, especially the edge directions, should be provided.

    (2) In order to illustrate the novel discovery, CausalCell should be further compared to existing gene network construction methods based on scRNA-seq data such as SCENIC (Aibar et al. Nature Methods, 2017).

    (3) The authors should also claim what type of the inferred causal network represent from the biological perspective (e.g. signaling networks or gene regulatory networks?).

    (4) Besides edge direction, an important feature of CausalCell is the determination of edge sign (i.e. activation or inhibition). The authors should describe its related procedures.

    (5) The authors did not provide an example of constructing a causal network between cells or cell types, although they mentioned its importance in the Abstract. Such intercellular network examples can distinguish the utility of CausalCell in single-cell data analysis from bulk data analysis.

    (6) If the control dataset is available, it is currently not clear whether batch effects of the query and control datasets will be removed in the data pre-processing step. Differentially expressed genes cannot be selected correctly if batch effects exist.