Transiently increased intercommunity regulation characterizes concerted cell phenotypic transition

This article has been Reviewed by the following groups

Read the full article

Listed in

Log in to save this article

Abstract

Phenotype transition takes place in many biological processes such as differentiation and reprogramming. A fundamental question is how cells coordinate switching of expressions of clusters of genes. Through analyzing single cell RNA sequencing data in the framework of transition path theory, we studied how such a genome-wide expression program switching proceeds in five different cell transition processes. For each process we reconstructed a reaction coordinate describing the transition progression, and inferred the gene regulation network (GRN) along the reaction coordinate. In all processes we observed common pattern that the overall effective number and strength of regulation between different communities increase first and then decrease. The change accompanies with similar change of the GRN frustration, defined as overall conflict between the regulation received by genes and their expression states, and GRN heterogeneity. While studies suggest that biological networks are modularized to contain perturbation effects locally, our analyses reveal a general principle that during a cell phenotypic transition, intercommunity interactions increase to concertedly coordinate global gene expression reprogramming, and canalize to specific cell phenotype as Waddington visioned.

Article activity feed

  1. Note: This rebuttal was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Reply to the reviewers

    *Reviewer #1 (Evidence, reproducibility and clarity (Required)): **

    In this manuscript by Wang and colleagues, the authors analyse single-cell RNA-seq (scRNAseq) data by applying transition path theory to infer gene regulatory network (GRN) changes along the transition (reaction coordinate, trajectory) between free energy stable states (i.e. cell types). The work aims to understand how stable cell types, and their regulatory programs (combination of active and repressed genes) switches during differentiation/reprogramming/response (i.e. cell phenotypic transition/CPT). The premise of the work is to assess whether genes within GRNs undergo step-wise repression, state-change and activation (& vice-versa; analogous to SN1) or concurrently regulate gene expression (analogous to SN2). The GRNs are inferred based on highly variable genes and their expression dynamics from RNA velocity over CPT, across 3 scRNA-seq datasets.

    The authors first analyse public scRNA-seq dataset of 3003 human A549 adenocarcinomic basal epithelial cells treated with TGF-b for 0hrs, 8hrs, 1 day and 3 days (4 timepoints). The authors select two stable states (Day0-untreated; Epithelial and Day 3-treatment; Mesenchymal) using local kernel densities and set transition paths using Dijkstra shortest path, dividing state space into Voronoi cells (i.e. reaction coordinate value), and constructed single-cell GRNs based on RNA velocity differences (n=500 genes) and a linear model (from Qiu et al). This GRN is based on expression and velocity estimates, and does not distinguish direct from indirect regulation. Calculating interaction frequency (edges) across two stable states over 4 louvain clusters, the authors find global increase in effective edges that correlates with increased active genes; but with variable trend within inter-cluster edges. To quantify the concerted GRN changes between clusters, the authors utilise a "frustration" score (from Tripathi et al 2020). The average frustration score increases and peaks at day 1 treatment, followed by a decline over terminal stable state (day 3-treatment); similar to interaction frequency trends. The author also separately measure network heterogeneity and repeat analysis using alternative transition matrix. The authors conclude that EMT proceeds through concerted regulation of multiple genes first with an increase in inter-cluster edges, frustration and heterogeneity followed by a decrease into final stable state. The authors apply the analysis to scRNA-seq data from (i) pancreatic endocrine differentiation where Ngn3-low progenitors give rise to Ngn3-high, then Fev-high and into glucagon producing a-endocrine cells; (ii) dendate gyrus; radial glial cell differentiation into nIPCs, neuroblast, immature granule and mature granule cells. In both cases, the authors observe concerted regulation with initial increase in inter-community edges, heterogeneity during differentiation followed by decrease towards final stable state. **

    The study and ideas in the manuscript are interesting and the methods would be potentially be useful. However, there are a few specific and general comments stated below, which the authors should try to address.

    1 • P4: "RC increases first and reaches a peak when cells were treated with TGF-β for about one day, then decreases (Fig. 1G)". It would be better to label the figure with the treatment information. *

    Reply: Thanks for your advice. In the revised manuscript, we analyzed two additional datasets, and moved the EMT result in the supplemental Fig. EV8. In the new Fig. 1d, we marked the cell types along the reaction coordinate.

    __ __*2 • Fig. 1G and EV1D: Why are the trends different? *

    __Reply: In the original figures, ____Fig____.1g is the frustration score and EV1D shows the variation of pseudo-Hamiltonian along the reaction coordinate. The frustration score is the focus of this work. We also calculated the pseudo-Hamiltonian since it has been used in the literature. However, we realized that showing both of the results might lead to confusion, so we deleted all pseudo-Hamiltonian results in the revised manuscript. __

    3 • How is the appropriate community/cluster/Louvain resolution selected? This can have a major impact on number of cell states, types and transition path from initial to final state. *

    Reply: The number of cell states, types and transition path from initial to final state____ are not determined from the community/cluster/Louvain analyses. For the EMT data, we assume most cells in the initial treatment time are epithelial cells, and those in the final time point are mesenchymal cells. For other datasets, we followed the original publications to assign cell types based on known marker expression.

    __The Louvain method was applied to coarse grain the gene regulation network, and it does not affect the number of cell states, types and transition path, which were determined separately. To address the reviewer’s question, we also use the Leiden method to adjust the resolution ____(1)____. The resolution does not affect the result. The results are added to Fig. EV12. We tried three different resolution values 0.8,1.0 and 1.2. The number of inter-community edges consistently shows the trend that it increases first then decreases. __

    Figure EV12 Cell-specific variation of the number of effective inter-community edges between communities calculated with different resolution parameter values for dentate gyrus neurogenesis (a), pancreatic endocrinogenesis (b), and bone marrow marrow hematopoiesis (c). Each dot represents a cell and the color represents the number of inter-community edges____.

    What effect does the Louvain resolution have on e.g. frustration scores? * Reply: The resolution of community division algorithm doesn’t affect the frustration scores, since the frustration score is based on the gene-gene interactions instead of community assignment.

    The authors match resolution to samples/timepoints/known prior cell types i.e. 3-4 communities. However it is unclear whether this is enough to describe entire differentiation/transition process. * Reply: This is a good question. In one above reply we have explained how the cell types were determined____. We also agree with the reviewer that these coarse-grained communities cannot reflect the overall heterogeneity and dynamics of the whole process. Notice in most of our analyses (e.g., reaction coordinate and transition paths), we treated the transition as continuous and the distribution of single cell data points in all datasets cover the whole space that involved in cell phenotype transition. The coarse-grained analyses are for further mechanistic insights on how gene regulatory networks are reorganized during the transition process.

    Gene selection: The selection based on minimum 20 counts as highly expressed genes is arbitrary and dependent on sequencing depth. Perhaps the authors could show distribution of gene counts for the datasets and have a data-driven filtering criteria * __Reply: Thanks for the advice. The number 20 is a default value suggested in the package (scVelo) we use, and in another package dynamo the default number is 30. Following the reviewer’s suggestion (together with the next question on the influence of all highly variable genes), we looked for a data-drive filtering criterion. The method has been described in different tools ____(2-4)____. We first grouped the genes into 20 bins by their mean expression values, and____ scaled their dispersions by subtracting the mean of dispersions and dividing standard deviation of dispersions____. Figure EV9 shows the distribution of the minimum shared counts. ____As one can see, most genes counts are larger than 10, and using a smaller value causes error in the following velocity analysis. Therefore we set the minimum shared counts as 10 in the new results. __

    Figure EV9 Shared counts distribution of the datasets. (a) Dentate gyrus neurogenesis; (b) Pancreatic endocrinogenesis; (c) Bone marrow hematopoiesis.

    The choice of 500 variable genes (for human A549 cells) is also quite arbitrary. Perhaps, the authors could compare how additional genes (all highly variable genes) affects their analysis and interpretation. * Reply: ____Thanks. Following previous question on shared counts and ____data-driven filtering criteria____,____ we take all the highly variable genes into consideration. The details of gene selection and binarization are given in the Materialss and Methods (Materials and Methods 2) section.

    How are other factors (sequencing depth, genes detected, #of cell types, multiple branches) affects the connectivity between communities at different phases of transition/development? * Reply: This is a good question. The A549 EMT dataset has a sequence depth of 40000-50000. The ____dentate gyrus neurogenesis dataset____ has a sequence depth of 56,700 reads. A saturation depth would be close to 1,000,000, but there is a compromise between cell number and depth. There are genes that are not detected even under the saturation reads setting. That is why the preprocessing is needed. On the other hand, the network we inferred include both direct and indirect interaction, so the influence of sequence depth and gene number detected can be reduced to a certain extent. We used a random subset of the selected gene and performed the same analyses. The results are consistent with what we obtained using all the genes (Fig. EV11b). With the new gene selection criteria (Materials and Method 2), our analyses are not related with the number of cell types.

    __ We did analysis on another beta branch of pancreatic endocrinogenesis data. The other branches show the same results (Fig. EV4). There are two additional branches in the pancreatic endocrinogenesis dataset. It has been reported that the RNA velocity estimation for the epsilon branch is incorrect ____(3)____. There are too few cells in the delta branch for reliable analyses. Therefore we didn’t present results for these two branches.__

    Figure EV4 Analyses on the branch of glucagon producing β-cells in pancreatic endocrinogenesis.

    *(a) Transition graph based on RNA velocity. *

    (b) The RCs and corresponding Voronoi cells. The large colored dots represent the RC points (start from blue and ends in red). The small dots represent cells with color as cell type.

    (c) Frustration score along the RCs.

    *(d) Cell-specific variation of effective intercommunity regulation. Each dot represents a cell. Color represents the number of effective intercommunity edges within each cell in the GRN. *

      • Are the velocity graph, transition matrix and further shortest path estimation derived in a reduced latent space, and if so, how much (nPCs) and what impact does it have. Presumably, the density estimation is not performed in expression space. ** __Reply: Yes. ____The calculation of transition matrix is based on neighbor information. The calculation of neighbors was in the reduced latent space in scVelo and Dynamo. We performed the same analysis by varying number of principal components. The results are similar because the first several components account for large proportion of variance. Figure R1 shows the results of dentate gyrus neurogenesis with the number of principal components being 10, 20 and 30, respectively. In the revised manuscript, we delete the step of using density estimation constrain to simplify the procedure. __ __Figure R1 Frustration scorer along RCs (left) and cell specific variation of number of effective intercommunity edges (Each dot represents a cell and color represents the number of effective intercommunity edges) in the GRN within each cell (right) when using different number of PCs in analyses (dentate gyrus neurogenesis): (a) number of PCs is 10.*__

    (b) number of PCs is 20. (c) number of PCs is 30

    • The figure legends and labels were hard to read. These should be improved for better readability. *

    __Reply: Thanks. We modified the figure legends and labels. __

    • A suggestion would be move the initial results section to methods and highlight the biological interpretation. *

    __Reply: Thanks for your advice. We moved large part of this section to the Materials and Methods. __

    *The authors could highly which GRN and representative genes/edge pairs are highest ranked within inter-community and to overall final stable states. *

    Reply: Thanks. We list some representative gene pairs in the Table. EV 2&EV 3 &EV 4 for different datasets. And we performed gene enrichment analysis for each community.

    • How does the GRN inference compare to current state-of-the-art GRN inference scRNA-seq methods? *

    Reply: we used the method GRISLI to perform the same analysis ____(5)____. The results are similar to what obtained with our current method (Figure EV6). We want to emphasize that the focus of this work is not on another GRN inference method, but discussing some general principles of GRN reorganization during a cell phenotypic transition process.

    Figure EV6 Analyses of datasets of dentate gyrus neurogenesis (a), pancreatic endocrinogenesis (b), and hematopoiesis (c) based on GRN inferred with GRISLI.

    (a) Frustration score along the RCs of dentate gyrus neurogenesis (left) and cell-specific variation of the number of inter-community edges (right). Each dot represents a cell and color represents the number of inter-community edges in GRN within each cell.

    *(b) Same as in panel (a), except for pancreatic endocrinogenesis. *

    (c) Same as in panel (a), except for hematopoiesis.

    • How do extremely noisy/stochastic genes vary in metrics between final stable states? How are the metrics affected by number of cells and stochasticity of expression within a given cluster/community. *

    Reply: To address this question, we selected two genes, Id2 and Cdkn1c, with high variance and compare their distributions in the initial and final states. ____The gene distributions show significant shift between the Ngn3 low EP cells and Alpha cells (Fig. R2 a &b left).____ Then we randomly selected a subset (half) of cells and compared the distributions of these high-variance genes in the sub-population (Fig. R2 a&b right). The results are similar to the full-set results.

    Fig. R2 Comparison of gene distribution in the initial and final states in pancreatic endocrinogenesis. (a) Comparison of the distribution of gene Id2 at the initial and final states (left), and in the randomly selected sub-population at the initial and final states (right). (b) Comparison of the distribution of Cdkn1c at the initial and final states (left), and in the randomly selected sub-population at the initial and final states (right).

    • Given that the author's approach includes both direct and indirect genes effects, the authors could further prune genes based on existing TF databases or protein-protein validated networks. *__Reply: This is a good suggestion. We will work on this idea in future work. As we mentioned, due to constrains of data quality, only tens of transcription factors can be analyzed in these dataset. We list some regulations of transcription factors inferred with current method in Table EV1. __
    • *It is unclear which GRNs are already known and which ones are novel and biologically relevant * Reply: We compare some regulations inferred with the method and compare these interactions w____ith some references in Table. EV1____.
    • It would be good for authors to comment when there are multiple bifurcations instead of A-B transitions. Particularly in datasets with multiple discrete stable states. *__Reply: This is a good question.____ In our analysis, we focus on the transition from one stable state to another stable state. For transition process with multiple bifurcations like____ the pancreatic endocrinogenesis, the results are similar across different branches. For the transition that goes through multiple discrete stable states, for example, a transition from state A____à____B____à____C, we expect to observe two peaks in the frustration score and the number of inter-community edges. We added some discussions in the Discussion section. __
    • *Another suggestion would be to highlight gene expression of selected markers based on f-regression and mi over the trajectory * Reply: As we modified the criteria of gene selection, we plotted trajectories of some high-variance genes versus the reaction coordinate obtained with different datasets in Fig. EV10 based on current criteria.

    Figure EV10 ____Typical trajectories of high variance genes versus RCs of dentate gyrus neurogenesis (a), pancreatic endocrinogenesis (b) and bone marrow ____hematopoiesis ____(c).

    • If possible, a proof of principle could be re-analysis of a perturbation scRNA-seq dataset (e.g. where one path/transition path is stalled) *

    Reply: Thanks. This is a really a good suggestion. We will perform more systematic studies in future work.

    Reviewer #1 (Significance (Required)): Nature and significance of advance: The study and ideas in the manuscript are interesting and the methods would be potentially be useful to community. Compare to existing published knowledge: *

    *Audience: Predominantly computational audience *

    *Your Expertise: PI with background in experimental, computational biology and expertise in single-cell genomic tools and developmental biology *

    Reviewer #2 (Evidence, reproducibility and clarity (Required)):

    Understanding the cellular and molecular basis of cell type or cell state transitions occurring during development or reprogramming is a fundamental challenge. scRNA-seq has provided a window into gene expression programs across thousands of cells undergoing such transitions. Wang and colleagues leverage scRNA-seq and develop an approach to reverse engineer gene regulatory network underlying cells along a path from one cell type/state to another, and characterize community-level properties of this network associated with various stages of the cell phenotype transition. The study is innovative and rigorous, and their results point to how intercommunity interactions increase and then decrease, indicating a concerted regulatory rewiring that orchestrates transitions. Application of their approach to three different datasets also shows that this trend is consistent across three different transitions and maybe a general trend. However, there are some major and minor concerns that need to be addressed.

    **Major comments and questions**

    1. The analogy to SN1 and SN2 mechanisms of chemical bond formation is very nice.
    2. What is the basis for the two statements made in paragraph 3 of Introduction (beginning with "A question arises ...") about transitions being sequential or concurrent? Please *Reply: Thanks. We added references in this paragraph.

    2.1. Provide references to previous experimental and computational studies that have investigated developmental and reprogramming gene expression programs. *

    __ Reply: Thanks. We added a paragraph in the Introduction.__

    2.2. Describe specific examples of findings that support the two possible transitions highlighted here. Why couldn't transitions happen through an entirely gradual process involving changes to overlapping subsets of genes. *

    __Reply: Thanks. In the review paper of Naomi Moris et. al., they proposed the hypothesis that cell phenotype transition is similar to a chemical reaction ____(6)____. Thus we extrapolate this hypothesis and test it in our study. For the example of SN1 mechanism, ____Kalkan et al. showed that mouse embryonic stem cells can exit from ____naïve pluripotency____ but remain uncommitted ____(7)____. __

    Just like the SN1 and SN2 mechanisms are two extremes in chemical reactions and there are cases lie in between, for cell phenotypic transitions we agree with the reviewer that such gradual process may exist. Actually the result in Fig. EV4d shows that the frustration score remains flat for the Fev+ ____à____ Beta transition, suggesting a possible gradual process*. *With the analyses provided in this work, such as the reaction coordinate, frustration score, heterogeneity, and inter-/intra- community edges, one may perform more systematic studies on a larger number of datasets and enumerate/classify possible patterns of transitions.

    • Please make plots of the number of effective intra-community edges vs. number of active genes to support the statement that these two numbers are correlated. *

    Reply: We plotted the corresponding intra-community active genes and calculated its correlation coefficient with the number of effective intra-community edges in dentate gyrus neurogenesis (Fig. EV1d). ____The correlation coefficients are 0.91,0.96, 0.99 and 0.96 for community 0, 1, 2 and 3 separately.

    A bunch of notations are not clear:

    4.1. What is the "r" in "strongest intercommunity interactions at r = 10 (Fig. 1F)"? Is it the same as the "r" mentioned in the Methods section? *

    Reply: r____ is the index number of the discretized reaction coordinate. We added it when we define the reaction coordinate. We modified the conflict usage of r in Materials and Method 4.

    __ __* 4.2. What is "s_i" in "cell-specific effective matrix, Fbar_ij = (2*s_i - 1)*F_ij"? Also, that description of F_ij, f_ij, and H should be moved to the Methods section, and a more high-level, intuitive description should instead be included in this Results paragraph. *Reply: __ __ represent the binarized gene expression state. __ __ is 0 for when gene is in low expression level (silence) and is 1 when gene is in high express level (active). We modified this part following your advice.

    How were the h_f and h_m thresholds chosen? *

    __Reply: __ __ and __ __ are based on the distribution of each dataset. Following suggestions from another reviewer, we modified this part. All the highly variable genes were selected and the genes were binarized with the Silverman’s bandwidth method and ____K____means (Materials and Methods 2). __

    What is the "density of each single cell" ("_t")? The formulation of the penalty of the distance between cells i and j (the expression with -logP_ij...) is unclear. What is the intuition behind it? What is r? How were the values of r (0.5 and 0.8) chosen? *

    __Reply: The probability density of cells in the expression space is based on the kernel density estimation. Intuitively, a region in the expression space with more cells is more likely passed by more cell trajectories. The values are based on the distribution of kernel density estimation in different datasets. __

    In the modified manuscript, we used trajectory simulation and deleted this assumption for simplification.

    One of the reasons the authors state to justify the choice of PLSR is "In the scRNA dataset, the number of genes is often comparable to or larger than the number of cells." This is not true most of the time. In nearly all recent studies, the number of cells is way larger than the number of genes measured. *

    Reply: The PLSR method definitely can be used for the data whose number of cells is larger than the number of genes. Also the PLSR method was applied on cells that are the k nearest neighbors of each reaction coordinate, which are a subset of the whole dataset (Materials and Methods 5). While we mainly presented results with the PLSR method, in this revised manuscript we also added results with another method of GRISLI (Materials and Methods 9). The results are similar with what we obtained with PLSR.

    There is a fleeting reference to a nice previous finding that supports their observations: "several lines of evidence support that EMT proceeds through a concerted mechanism. Indeed, both in vivo and in vitro studies have identified intermediate states of EMT that have co-expressed epithelial and mesenchymal genes (Pastushenko et al, 2018; Zhang et al, 2014)". The authors should thoroughly survey the literature related to EMT transition, development of pancreatic endocrine cells, and development of the granule cell lineage in dentate gyrus, to find more previously identified molecular/cellular features relevant to cell state/type transitions, compared and contrasted with findings from this study. *

    Reply: Thanks. We added references on these cell phenotype transitions and modified the corresponding part. We do want to point out that the main focus of this work is that all these processes share a common feature of transient increase of intercommunity interactions.

    What is the "dynamo" package, which is supposed to contain a Python notebook? As of now, the code and data have not been made available. Both need to be released along with thorough documentation on how to run the code to reproduce the analyses described here. *Reply: Thanks. Dynamo is a python package accompanying our recent publication ____(8)____. We uploaded the code on Github and added the link of Dynamo.

    **Minor comments and questions**

    1. Replace "confliction" throughout the manuscript with "conflict" or "conflicting" as appropriate. *

    Reply*: *Thanks. We modified them.

    Paragraph two of the Introduction (beginning with "Another example of transitions ...") is missing multiple references, esp. for the last four sentences. *

    Reply*: *Thanks. We added references.

    There are direct quotes from previous papers like "predicts the future state of individual cells on a timescale of hours". The authors are highly encouraged to check for usage of exact phrasing using available text software such as iThenticate. *

    __Reply____*: ____Thanks a lot for pointing out this severe mistake. We re-edited the manuscript and checked with iThenticate. __

    • "Each community contains both E and M genes": what does this mean? *

    Reply: The E (M) genes are defined as those genes that are active or have high expression levels in epithelial (mesenchymal) state or sample. As we reorganized the manuscript, we add this explanation for all datasets in the caption of Fig.1i.*

    • Reference to Qui 2021 is missing in the "Path analysis" subsection under Methods. *

    Reply: We added it in the Methods.

    Fix: "transition between the cells that their sample time points are successive" in Methods. *

    Reply: Thanks. ____We modified it.

    In Methods, under "Network inference", it is "partial least square regression" (not *least* s square). *

    Reply: Thanks. We modified it.

    Figure 1: The cyan, magenta, and lime in 1C are very hard to see and, perhaps, the grey of the points can be made lighter. Also, change the red and green colors for the arrows in 1I to something else. These colors are not colorblind-friendly. *

    Reply: Thanks. We re-plotted the figures and changed the colormap.*

    • Periods and commas are missing at several places. ** Reply: Thanks. We modify these and re-edit the manuscript.

    Reviewer #2 (Significance (Required)):

    The study uses RNA-velocity calculated from scRNA-seq data in an inventive way to characterize paths that reflect cell phenotype transitions. Then, a sparse gene regulatory network is reverse engineered from the data and the community structure within this network is examined at various stages along the transition to make observations about inter- and intra-community regulation and network "frustration". However, the study lacks the context of existing literature in terms of previous work studying cell transitions both experimentally and computationally. Adding this context (as suggested in the comments) will considerably improve the utility and significance of the findings. Overall, this study will be of broad interest to researchers interested in development and reprogramming as well as computational scientists developing and applying methods for scRNA-seq data analysis, trajectory inference, and network reconstruction. All the comments and questions raised here are based on my background and expertise in omics data (including scRNA-seq) analysis and network biology.

    Reviewer #3 (Evidence, reproducibility and clarity (Required)):

    The authors analyze three datasets of Single cell RNA velocity measured during phenotypic transition. They infer the gene regulatory network in each case and characterize the transition between the initial and final expression states (in which different sets of genes are expressed). Their motivating question was to find whether during such transitions first genes characterizing the initial state are no longer expressed and only then the genes associated with the final state start expressing or alternatively there is gradual transition through an intermediate state in which subsets of both initial and final state genes are transiently expressed.

    They define a measure of regulatory frustration representing the mismatch between regulatory signals a gene receives and its current expression state. They conclude that phenotypic transitions involve transient interactions between otherwise non-interacting gene modules and a temporary increase of gene frustration, which is relaxed once the final expression state is reached.

    The study uses of advanced inference and machine learning methods.

    I find the question studied in this manuscript interesting, opening avenue to further questions and studies and relevant to different scientific communities. Personally I think that the focus of the paper should be the exposition of the methods used this manuscript would benefit from a longer format, but that depends of course on the journal they are aiming at. *

    Statistical analysis is missing. Especially since the authors mention the potential of over-fitting due to large number of genes (on the order of the number of cells) - I think the authors should provide a sensitivity analysis testing how sensitive are the conclusions to the choice of cells or genes by applying the methods to subsets of the cells / genes. *

    __Reply: Thanks. For the subset of cells, we randomly selected cells from the dataset and performed the analyses (Fig. EV11a). For the subset of genes, we selected a subset of genes randomly and performed the analyses (Fig. EV 11b). We found the results are not affected. We also perform another statistical analysis by varying the value of resolution in community detection algorithm. And we found that the conclusion on variation of inter-community edges is not affected (Fig. EV12). __

    Figure EV11 Statistical analyses of dentate gyrus neurogenesis. Each dot represents a cell and color represents the number of inter-community edges.

    *(a) Frustration score along the RCs (left) and cell-specific variation of the number of inter-community edges (right) of a randomly selected sub-population of 2000 cells (from a total of 3184 cells); *

    (b) Frustration score along the RCs (left) and cell-specific variation of the number of inter-community edges) (right) of cells on the space of 400 randomly selected genes (from a total of 678 genes).

    *What is the meaning of the distribution in the frustration plots? *

    Reply: For each cell we calculated a frustration score. Therefore for cells in each Voronoi cell (which is a geometric cell, don’t be confused with the biological “cells”) along the reaction coordinate (Fig.1d, Fig. 2b &2g), we obtained a distribution of the frustration scores.*

    In general, the conclusions are well-justified, but I think some statements in the discussion are inaccurate: "intercommunity interactions of a GRN are indeed minimized' - are they minimal or are they only lower at the stable states? There are two stable states - for which of them is intercommunity interaction lower? *

    Reply: Thank. We agree with the reviewer and modified the writing. Comparing with the transition state, the number of intercommunity interactions is less for the stable states. ____The datasets' quality are not high enough for us to investigate whether ____"intercommunity interactions of a GRN are indeed minimized”.*

    It is written in the discussion that 'for all three datasets frustration decreases with differentiation', but then Fig. 1g shows the opposite (final state is more frustrated than initial state). It is interesting to discuss the differences between the datasets analyzed in that respect and what could cause transition to a more frustrated state. I suggest that the authors also refer in the discussion to related questions and possible follow-up studies, such as: what determines the duration of the phenotypic transition? A relevant number is the switching time of a single gene. *

    __Reply: Good suggestion. Compared to other datasets, we found that the result of EMT shows larger variances. The relative difference of the frustration score is also affected by the GRN inference algorithm. For example, the difference between initial and final frustration scores of the pancreatic endocrinogenesis is more significant when using the GRISLI method (Figure EV6b). Given these, the trend that the frustration scores in the transition states transiently increase keep consistent. __

    __Our conclusion is limited by the quality of the data. So we delete this part of discussion in the manuscript. __

    __ Qiu et al. have shown that splicing-based ____RNA velocities are relative, while metabolic-labeling-based RNA velocities are more quantitative and accurate____(8)____. We will re-analyze this problem if data with metabolic labeling becomes available.__

    The authors mention at the end that the networks can often reach multiple final states from a common initial states. Do such transitions share some of their path (and in particular the intermediate frustrated state)? Given the intermediate connected state, it would be interesting to characterize the network stability to perturbations. *

    Reply: This is a very important question. To reliably address these questions, we need higher quality data. We plan to characterize the network stability to perturbations in future studies, while in our recent paper using a full nonlinear modeling framework____(8)____, we performed in silico perturbations.

    While interesting, the manuscript itself is unfortunately hard to read and would benefit from major editing, including better exposition of the science and language editing. *

    Reply: Thanks. We revised the manuscript extensively.*

    Methods: Description of PCA and 'revised finite temperature string method' are missing in the Methods section. *

    Reply:____ Thanks. PCA is used in RNA velocity analysis for dimension reduction. We added this in Materials and Methods 3. The revised string method is in Materials and Methods ____4.

    Some examples:

    Figure captions are very short and often non-informative. Some variables are not defined (or only defined later on) and the reader then needs to guess their meaning: it took me a while to understand what is 'r' in Fig. 1f and what 'r=10' (p. 4) means. *

    Reply: Thanks. ____r____ represents the index number of reaction coordinates. We added this in the manuscript where we define reaction coordinates.*

    p. 4: what are 'f' (as opposed to F) and 's_ij' and 's_j' (expression states?) Or is fs_ij one variable? What does a Hamiltonian of a cell mean (p. 4, bottom)? *

    Reply: __ is the regulation of gene ____j on gene i, and is the expression state of gene i (0 for silence, and 1 for active expression). is the frustration value of regulation from gene j to gene i. __

    The pseudo Hamiltonian value is proposed in the literature as an analogy of ____the magnetic systems following the work of Boolean model in EMT ____(9)____. A high Hamiltonian value indicates that the cell is in an unstable state. In the original manuscript we included this quantity since it has been discussed in the literature. However we found it causes confusion and is not necessary for our discussions, so we removed the pseudo-Hamiltonian results in the revised manuscript.

    P. 4: how are 'E and M genes' defined? *

    __Reply: The E (M) genes are defined as those genes that are active or have high expression levels at the epithelial (mesenchymal) state or sample. We explained our general strategy in the caption of Fig.1i . __* ** What does 'network heterogeneity' (p. 5) mean? *

    __Reply: Network heterogeneity measures how homogenously the connections are distributed among the genes____(10)____. A high heterogeneity ____means that some genes have high degree of connectivity (the so-called hubs), while some have low degree of connectivity. __

    Fig. 1 is too tiny and hard to read and details are missing. *

    Reply: Thanks. We modified this figure and caption.*

    A glossary for all the acronyms used would be very helpful. *

    Reply: Thanks. We added glossary in the manuscript.*

    Language (some examples):

    p. 5 bottom: Another system is on development... invitro -> in vitro

    p. 6: 'measure on developmental potential' -> measure of... *

    Reply: Thanks. We modified these and re-edited the whole manuscript.*

    Reviewer #3 (Significance (Required)):

    This study presents a methodological advance in demonstrating the application of data analysis methods to study developmental phenotypic transitions. High throughput measurements and computation power available today enable putting to test theoretical conjectures, as made by Waddington. I think this is a promising line of research, which could be used to further develop the computational methods as well as to further our understanding of developmental transitions and potentially develop associated mathematical modeling frameworks.

    This study should be of interest to a diverse readership composed of developmental biologists as well as to quantitative biologists and CS researchers applying optimization techniques and data analysis methods to high-throughput biological data.

    I am not an expert on the computational methods applied in this manuscript and hence cannot assess their correct use and statistical analysis.

    1. Traag VA, Waltman L, & van Eck NJ (2019) From Louvain to Leiden: guaranteeing well-connected communities. Scientific Reports 9(1):5233.
    2. Stuart T, et al. (2019) Comprehensive Integration of Single-Cell Data. Cell 177(7):1888-1902.e1821.
    3. Bergen V, Lange M, Peidli S, Wolf FA, & Theis FJ (2020) Generalizing RNA velocity to transient cell states through dynamical modeling. Nature Biotechnology 38(12):1408-1414.
    4. Wolf FA, Angerer P, & Theis FJ (2018) SCANPY: large-scale single-cell gene expression data analysis. Genome Biology 19(1):15.
    5. Aubin-Frankowski P-C & Vert J-P (2020) Gene regulation inference from single-cell RNA-seq data with linear differential equations and velocity inference. Bioinformatics (Oxford, England) 36(18):4774-4780.
    6. Moris N, Pina C, & Arias AM (2016) Transition states and cell fate decisions in epigenetic landscapes. Nature reviews. Genetics 17(11):693-703.
    7. Kalkan T, et al. (2017) Tracking the embryonic stem cell transition from ground state pluripotency. Development 144(7):1221-1234.
    8. Qiu X, et al. (2022) Mapping Transcriptomic Vector Fields of Single Cells. Cell 185(4):690-711.
    9. Font-Clos F, Zapperi S, & La Porta CAM (2018) Topography of epithelial–mesenchymal plasticity. Proceedings of the National Academy of Sciences 115(23):5902-5907.
    10. Gao J, Barzel B, & Barabási A-L (2016) Universal resilience patterns in complex networks. Nature 530(7590):307-312.
  2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #3

    Evidence, reproducibility and clarity

    The authors analyze three datasets of Single cell RNA velocity measured during phenotypic transition. They infer the gene regulatory network in each case and characterize the transition between the initial and final expression states (in which different sets of genes are expressed). Their motivating question was to find whether during such transitions first genes characterizing the initial state are no longer expressed and only then the genes associated with the final state start expressing or alternatively there is gradual transition through an intermediate state in which subsets of both initial and final state genes are transiently expressed.

    They define a measure of regulatory frustration representing the mismatch between regulatory signals a gene receives and its current expression state. They conclude that phenotypic transitions involve transient interactions between otherwise non-interacting gene modules and a temporary increase of gene frustration, which is relaxed once the final expression state is reached.

    The study uses of advanced inference and machine learning methods.

    I find the question studied in this manuscript interesting, opening avenue to further questions and studies and relevant to different scientific communities. Personally I think that the focus of the paper should be the exposition of the methods used this manuscript would benefit from a longer format, but that depends of course on the journal they are aiming at.

    Statistical analysis is missing. Especially since the authors mention the potential of over-fitting due to large number of genes (on the order of the number of cells) - I think the authors should provide a sensitivity analysis testing how sensitive are the conclusions to the choice of cells or genes by applying the methods to subsets of the cells / genes.

    What is the meaning of the distribution in the frustration plots?

    In general, the conclusions are well-justified, but I think some statements in the discussion are inaccurate: "intercommunity interactions of a GRN are indeed minimized' - are they minimal or are they only lower at the stable states? There are two stable states - for which of them is intercommunity interaction lower?

    It is written in the discussion that 'for all three datasets frustration decreases with differentiation', but then Fig. 1g shows the opposite (final state is more frustrated than initial state). It is interesting to discuss the differences between the datasets analyzed in that respect and what could cause transition to a more frustrated state. I suggest that the authors also refer in the discussion to related questions and possible follow-up studies, such as: what determines the duration of the phenotypic transition? A relevant number is the switching time of a single gene.

    The authors mention at the end that the networks can often reach multiple final states from a common initial states. Do such transitions share some of their path (and in particular the intermediate frustrated state)? Given the intermediate connected state, it would be interesting to characterize the network stability to perturbations. While interesting, the manuscript itself is unfortunately hard to read and would benefit from major editing, including better exposition of the science and language editing.

    Methods: Description of PCA and 'revised finite temperature string method' are missing in the Methods section.

    Some examples:

    Figure captions are very short and often non-informative. Some variables are not defined (or only defined later on) and the reader then needs to guess their meaning: it took me a while to understand what is 'r' in Fig. 1f and what 'r=10' (p. 4) means.

    p. 4: what are 'f' (as opposed to F) and 's_ij' and 's_j' (expression states?) Or is fs_ij one variable? What does a Hamiltonian of a cell mean (p. 4, bottom)?

    P. 4: how are 'E and M genes' defined?

    What does 'network heterogeneity' (p. 5) mean?

    Fig. 1 is too tiny and hard to read and details are missing.

    A glossary for all the acronyms used would be very helpful.

    Language (some examples):

    p. 5 bottom: Another system is on development... invitro -> in vitro

    p. 6: 'measure on developmental potential' -> measure of...

    Significance

    This study presents a methodological advance in demonstrating the application of data analysis methods to study developmental phenotypic transitions. High throughput measurements and computation power available today enable putting to test theoretical conjectures, as made by Waddington. I think this is a promising line of research, which could be used to further develop the computational methods as well as to further our understanding of developmental transitions and potentially develop associated mathematical modeling frameworks.

    This study should be of interest to a diverse readership composed of developmental biologists as well as to quantitative biologists and CS researchers applying optimization techniques and data analysis methods to high-throughput biological data.

    I am not an expert on the computational methods applied in this manuscript and hence cannot assess their correct use and statistical analysis.

  3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #2

    Evidence, reproducibility and clarity

    Understanding the cellular and molecular basis of cell type or cell state transitions occurring during development or reprogramming is a fundamental challenge. scRNA-seq has provided a window into gene expression programs across thousands of cells undergoing such transitions. Wang and colleagues leverage scRNA-seq and develop an approach to reverse engineer gene regulatory network underlying cells along a path from one cell type/state to another, and characterize community-level properties of this network associated with various stages of the cell phenotype transition. The study is innovative and rigorous, and their results point to how intercommunity interactions increase and then decrease, indicating a concerted regulatory rewiring that orchestrates transitions. Application of their approach to three different datasets also shows that this trend is consistent across three different transitions and maybe a general trend. However, there are some major and minor concerns that need to be addressed.

    Major comments and questions

    1. The analogy to SN1 and SN2 mechanisms of chemical bond formation is very nice.
    2. What is the basis for the two statements made in paragraph 3 of Introduction (beginning with "A question arises ...") about transitions being sequential or concurrent? Please

    2.1. Provide references to previous experimental and computational studies that have investigated developmental and reprogramming gene expression programs.

    2.2. Describe specific examples of findings that support the two possible transitions highlighted here. Why couldn't transitions happen through an entirely gradual process involving changes to overlapping subsets of genes.

    1. Please make plots of the number of effective intra-community edges vs. number of active genes to support the statement that these two numbers are correlated.
    2. A bunch of notations are not clear:

    4.1. What is the "r" in "strongest intercommunity interactions at r = 10 (Fig. 1F)"? Is it the same as the "r" mentioned in the Methods section?

    4.2. What is "s_i" in "cell-specific effective matrix, Fbar_ij = (2s_i - 1)F_ij"? Also, that description of F_ij, f_ij, and H should be moved to the Methods section, and a more high-level, intuitive description should instead be included in this Results paragraph.

    1. How were the h_f and h_m thresholds chosen?
    2. What is the "density of each single cell" ("⍴_t")? The formulation of the penalty of the distance between cells i and j (the expression with -logP_ij...) is unclear. What is the intuition behind it? What is r? How were the values of r (0.5 and 0.8) chosen?
    3. One of the reasons the authors state to justify the choice of PLSR is "In the scRNA dataset, the number of genes is often comparable to or larger than the number of cells." This is not true most of the time. In nearly all recent studies, the number of cells is way larger than the number of genes measured.
    4. There is a fleeting reference to a nice previous finding that supports their observations: "several lines of evidence support that EMT proceeds through a concerted mechanism. Indeed, both in vivo and in vitro studies have identified intermediate states of EMT that have co-expressed epithelial and mesenchymal genes (Pastushenko et al, 2018; Zhang et al, 2014)". The authors should thoroughly survey the literature related to EMT transition, development of pancreatic endocrine cells, and development of the granule cell lineage in dentate gyrus, to find more previously identified molecular/cellular features relevant to cell state/type transitions, compared and contrasted with findings from this study.
    5. What is the "dynamo" package, which is supposed to contain a Python notebook? As of now, the code and data have not been made available. Both need to be released along with thorough documentation on how to run the code to reproduce the analyses described here.

    Minor comments and questions

    1. Replace "confliction" throughout the manuscript with "conflict" or "conflicting" as appropriate.
    2. Paragraph two of the Introduction (beginning with "Another example of transitions ...") is missing multiple references, esp. for the last four sentences.
    3. There are direct quotes from previous papers like "predicts the future state of individual cells on a timescale of hours". The authors are highly encouraged to check for usage of exact phrasing using available text software such as iThenticate.
    4. "Each community contains both E and M genes": what does this mean?
    5. Reference to Qui 2021 is missing in the "Path analysis" subsection under Methods.
    6. Fix: "transition between the cells that their sample time points are successive" in Methods.
    7. In Methods, under "Network inference", it is "partial least square regression" (not least s square).
    8. Figure 1: The cyan, magenta, and lime in 1C are very hard to see and, perhaps, the grey of the points can be made lighter. Also, change the red and green colors for the arrows in 1I to something else. These colors are not colorblind-friendly.
    9. Periods and commas are missing at several places.

    Significance

    The study uses RNA-velocity calculated from scRNA-seq data in an inventive way to characterize paths that reflect cell phenotype transitions. Then, a sparse gene regulatory network is reverse engineered from the data and the community structure within this network is examined at various stages along the transition to make observations about inter- and intra-community regulation and network "frustration". However, the study lacks the context of existing literature in terms of previous work studying cell transitions both experimentally and computationally. Adding this context (as suggested in the comments) will considerably improve the utility and significance of the findings. Overall, this study will be of broad interest to researchers interested in development and reprogramming as well as computational scientists developing and applying methods for scRNA-seq data analysis, trajectory inference, and network reconstruction. All the comments and questions raised here are based on my background and expertise in omics data (including scRNA-seq) analysis and network biology.

  4. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #1

    Evidence, reproducibility and clarity

    In this manuscript by Wang and colleagues, the authors analyse single-cell RNA-seq (scRNAseq) data by applying transition path theory to infer gene regulatory network (GRN) changes along the transition (reaction coordinate, trajectory) between free energy stable states (i.e. cell types). The work aims to understand how stable cell types, and their regulatory programs (combination of active and repressed genes) switches during differentiation/reprogramming/response (i.e. cell phenotypic transition/CPT). The premise of the work is to assess whether genes within GRNs undergo step-wise repression, state-change and activation (& vice-versa; analogous to SN1) or concurrently regulate gene expression (analogous to SN2). The GRNs are inferred based on highly variable genes and their expression dynamics from RNA velocity over CPT, across 3 scRNA-seq datasets.

    The authors first analyse public scRNA-seq dataset of 3003 human A549 adenocarcinomic basal epithelial cells treated with TGF- for 0hrs, 8hrs, 1 day and 3 days (4 timepoints). The authors select two stable states (Day0-untreated; Epithelial and Day 3-treatment; Mesenchymal) using local kernel densities and set transition paths using Dijkstra shortest path, dividing state space into Voronoi cells (i.e. reaction coordinate value), and constructed single-cell GRNs based on RNA velocity differences (n=500 genes) and a linear model (from Qiu et al). This GRN is based on expression and velocity estimates, and does not distinguish direct from indirect regulation. Calculating interaction frequency (edges) across two stable states over 4 louvain clusters, the authors find global increase in effective edges that correlates with increased active genes; but with variable trend within inter-cluster edges. To quantify the concerted GRN changes between clusters, the authors utilise a "frustration" score (from Tripathi et al 2020). The average frustration score increases and peaks at day 1 treatment, followed by a decline over terminal stable state (day 3-treatment); similar to interaction frequency trends. The author also separately measure network heterogeneity and repeat analysis using alternative transition matrix. The authors conclude that EMT proceeds through concerted regulation of multiple genes first with an increase in inter-cluster edges, frustration and heterogeneity followed by a decrease into final stable state. The authors apply the analysis to scRNA-seq data from (i) pancreatic endocrine differentiation where Ngn3-low progenitors give rise to Ngn3-high, then Fev-high and into glucagon producing -endocrine cells; (ii) dendate gyrus; radial glial cell differentiation into nIPCs, neuroblast, immature granule and mature granule cells. In both cases, the authors observe concerted regulation with initial increase in inter-community edges, heterogeneity during differentiation followed by decrease towards final stable state.

    The study and ideas in the manuscript are interesting and the methods would be potentially be useful. However, there are a few specific and general comments stated below, which the authors should try to address.

    • P4: "RC increases first and reaches a peak when cells were treated with TGF-β for about one day, then decreases (Fig. 1G)". It would be better to label the figure with the treatment information. • Fig. 1G and EV1D: Why are the trends different? • How is the appropriate community/cluster/Louvain resolution selected? This can have a major impact on number of cell states, types and transition path from initial to final state. • What effect does the Louvain resolution have on e.g. frustration scores? • The authors match resolution to samples/timepoints/known prior cell types i.e. 3-4 communities. However it is unclear whether this is enough to describe entire differentiation/transition process. • Gene selection: The selection based on minimum 20 counts as highly expressed genes is arbitrary and dependent on sequencing depth. Perhaps the authors could show distribution of gene counts for the datasets and have a data-driven filtering criteria • The choice of 500 variable genes (for human A549 cells) is also quite arbitrary. Perhaps, the authors could compare how additional genes (all highly variable genes) affects their analysis and interpretation. • How are other factors (sequencing depth, genes detected, #of cell types, multiple branches) affects the connectivity between communities at different phases of transition/development? • Are the velocity graph, transition matrix and further shortest path estimation derived in a reduced latent space, and if so, how much (nPCs) and what impact does it have. Presumably, the density estimation is not performed in expression space.

    • The figure legends and labels were hard to read. These should be improved for better readability.
    • A suggestion would be move the initial results section to methods and highlight the biological interpretation. The authors could highly which GRN and representative genes/edge pairs are highest ranked within inter-community and to overall final stable states.
    • How does the GRN inference compare to current state-of-the-art GRN inference scRNA-seq methods?
    • How do extremely noisy/stochastic genes vary in metrics between final stable states? How are the metrics affected by number of cells and stochasticity of expression within a given cluster/community.
    • Given that the author's approach includes both direct and indirect genes effects, the authors could further prune genes based on existing TF databases or protein-protein validated networks.
    • It is unclear which GRNs are already known and which ones are novel and biologically relevant
    • It would be good for authors to comment when there are multiple bifurcations instead of A-B transitions. Particularly in datasets with multiple discrete stable states.
    • Another suggestion would be to highlight gene expression of selected markers based on f-regression and mi over the trajectory
    • If possible, a proof of principle could be re-analysis of a perturbation scRNA-seq dataset (e.g. where one path/transition path is stalled)

    Significance

    Nature and significance of advance: The study and ideas in the manuscript are interesting and the methods would be potentially be useful to community.

    Compare to existing published knowledge: -

    Audience: Predominantly computational audience

    Your Expertise: PI with background in experimental, computational biology and expertise in single-cell genomic tools and developmental biology