Challenges and Progress in RNA Velocity: Comparative Analysis Across Multiple Biological Contexts
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Review Commons)
Abstract
Single-cell RNA sequencing is revolutionizing our understanding of cell state dynamics, allowing researchers to observe the progression of individual cells’ transcriptomic profiles over time. Among the computational techniques used to predict future cellular states, RNA velocity has emerged as a predominant tool for modeling transcriptional dynamics. RNA velocity leverages the mRNA maturation process to generate velocity vectors that predict the likely future state of a cell, offering insights into cellular differentiation, aging, and disease progression. Although this technique has shown promise across biological fields, the performance accuracy varies depending on the RNA velocity method and dataset. We established a comparative pipeline and analyzed the performance of five RNA velocity methods on three datasets based on local consistency, method agreement, identification of driver genes, and robustness to sequencing depth. This benchmark provides a resource for scientists to understand the strengths and limitations of different RNA velocity methods.
Article activity feed
-
Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.
Learn more at Review Commons
Reply to the reviewers
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Reply to the Reviewers
I thank the Referees for their...
Referee #1
- The authors should provide more information when...
Responses
- The typical domed appearance of a hydrocephalus-harboring skull is apparent as early as P4, as shown in a new side-by-side comparison of pups at that age (Fig. 1A).
- Though this is not stated in the MS
- Figure 6: Why has only...
Response: We expanded the comparison
Minor comments:
- The text contains several...
Response: We added...
Referee #2
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #3
Evidence, reproducibility and clarity
Dr. Ancheta et al. designed several parameters to assess different velocity algorithms, including local consistency, method agreement, overlap of derived genes, and robustness to sequencing depth. Generally, this helps scientists understand the performance of each software. However, I don't think this is enough to help scientists judge which one is better. The biggest problem in the manuscript is the lack of a ground truth. I suggest the authors choose tissues with reliable ground truth, such as spermiogenesis, which has a single lineage direction. You would see a clear streamline from pachytene spermatocytes to sperm, or embryonic …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #3
Evidence, reproducibility and clarity
Dr. Ancheta et al. designed several parameters to assess different velocity algorithms, including local consistency, method agreement, overlap of derived genes, and robustness to sequencing depth. Generally, this helps scientists understand the performance of each software. However, I don't think this is enough to help scientists judge which one is better. The biggest problem in the manuscript is the lack of a ground truth. I suggest the authors choose tissues with reliable ground truth, such as spermiogenesis, which has a single lineage direction. You would see a clear streamline from pachytene spermatocytes to sperm, or embryonic cells with different culture days, where the direction should go from earlier to later stages. I strongly suggest the authors use clear lineage samples like spermiogenesis. After that, I think it will be a very helpful paper for scientists.
Local consistency is a useful parameter that helps scientists determine which one has more uniform directions.
Method agreement is problematic because I don't know which one is ground truth. If you cannot get ground truth, you could use the average direction or angle as the ground truth to see which one is significantly biased from the averages.
The overlap of the driver genes also lacks ground truth. Fig4C is good; we could use the overlap of all methods' genes as ground truth to see which one is too biased. I suggest you perform a GO term analysis to see the driver gene distribution and count how many genes are related to the expected GO term. That would also provide evidence to support the ground truth.
Robustness to sequencing depth is a good parameter. No comment.
The discussion could include more about other methods, such as nascent RNA single-cell sequencing methods and full-length single-cell sequencing methods, which improve the estimation of alpha, beta, and gamma. This could help delve deeper into enhancing the velocity program.
Significance
The biggest problem in the manuscript is the lack of a ground truth. I suggest the authors choose tissues with reliable ground truth, such as spermiogenesis, which has a single lineage direction. You would see a clear streamline from pachytene spermatocytes to sperm, or embryonic cells with different culture days, where the direction should go from earlier to later stages. I strongly suggest the authors use clear lineage samples like spermiogenesis. After that, I think it will be a very helpful paper for scientists.
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
The authors present a comparative statistical analysis of five RNA velocity methods using two datasets and a single performance metric. Using the selected statistical metric, they describe the variable performance of RNA velocity methods, their variable robustness across different cell states, and the discrepancy of sets of identified lineage-specific driver genes.
At this point, the scientific community extensively documented limitations and lack of stability of RNA velocity performance across methods and datasets. In that context, the manuscript lacks clear theoretical and practical conclusions that would be beneficial to the …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #2
Evidence, reproducibility and clarity
The authors present a comparative statistical analysis of five RNA velocity methods using two datasets and a single performance metric. Using the selected statistical metric, they describe the variable performance of RNA velocity methods, their variable robustness across different cell states, and the discrepancy of sets of identified lineage-specific driver genes.
At this point, the scientific community extensively documented limitations and lack of stability of RNA velocity performance across methods and datasets. In that context, the manuscript lacks clear theoretical and practical conclusions that would be beneficial to the scientific community.
The choice to focus on only a subset of RNA velocity methods is not discussed. Recent and important extensions, such as VeloVAE, VeloVI, LatentVelo, and Pyro-Velocity, are omitted, which limits the generality of the analysis. The statistical properties of the chosen consistency metric are not explored. The authors do not provide a justification for why this metric is appropriate for comparative analysis. Additionally, the authors do not present how the consistency score can be utilized to evaluate RNA velocity performance on user datasets. Overall, a discussion of the pressing issue of choosing statistical metrics to interpret RNA velocity results is lacking. The pros and cons of different RNA velocity methods, especially in light of the various statistical metrics, are not discussed. The manuscript does not present conclusions from the sampling analysis of sequencing depth. For instance, formalizing these findings with code that users can employ for their datasets would enhance the manuscript's practical utility. Overall, the manuscript would benefit from a thorough benchmark of the methodological approaches in the RNA velocity field, and testing various methods to evaluate RNA veloicty performance.
Significance
At this point, the scientific community extensively documented limitations and lack of stability of RNA velocity performance across methods and datasets. In that context, the manuscript lacks clear theoretical and practical conclusions that would be beneficial to the scientific community.
-
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
The authors present a survey of RNA velocity methods, evaluate them on a variety of model datasets, and introduce metrics to determine the local consistency of each individual method. They investigate differences between the methods with a separate metric that identifies consistency across methods, and using this, comment on applicability of each method to novel datasets. The effect of these differences on a downstream driver-gene identification task is also evaluated, and further conclusions are drawn related to this, particularly related to variability as a function of sequencing depth.
Major comments:
There are a few changes that …
Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.
Learn more at Review Commons
Referee #1
Evidence, reproducibility and clarity
The authors present a survey of RNA velocity methods, evaluate them on a variety of model datasets, and introduce metrics to determine the local consistency of each individual method. They investigate differences between the methods with a separate metric that identifies consistency across methods, and using this, comment on applicability of each method to novel datasets. The effect of these differences on a downstream driver-gene identification task is also evaluated, and further conclusions are drawn related to this, particularly related to variability as a function of sequencing depth.
Major comments:
There are a few changes that could be made to improve future applicability. One assumption seems to be that consistency between methods will indicate the most likely trajectory, however without a number of ground truth trajectories it seems that this is difficult to justify. In fact, there are probably good reasons why this would not be the case and that some methods might well be expected to underperform in certain cases. A deep comparison of the mechanics of the methods, or validating a robust set of novel ground truth trajectories, is probably beyond the scope of this paper, but it would be good to make some reference to the fact that these methods do differ in ways that might lead one to expect that some outperform others for good reasons.
Related to this, it's still tricky to identify a clear path between the underlying approaches of the methods, the empirical observations in Fig 6, and how a reader coming cold to the field could match these aspects to the aims for the analysis of their own dataset. Though, I think that this might well be solved by rephrasing Fig 6 to include statements on the dataset itself - e.g., if you have large transcriptional diversity and a smaller dataset, then probably you would disfavour DeepVelo.
One other suggestion relates to the uncertainties in the trajectories. This is touched upon in the case of the 'DeepVelo' method, where, for example, it's mentioned that these have a large number of parameters and could be prone to overfitting. However, the paper doesn't go as far to suggest that this could result in the trajectories having a much higher variance, which is is potentially evident in the sequencing depth study. This is perhaps a confounding issue in the comparisons in, for example, Fig 1a, where it seems that there is much more dynamical structure in the DeepVelo plot, but in reality this may be due to a higher degree of variance in the predictions. In this case, it may be that, in fact, all of the trajectories are perfectly consistent between the methods, within their uncertainties, despite the 'mean' values displayed in the plot looking quite different.
Likely a full accounting of the uncertainties on all of the outputs of these models is also beyond the scope of the paper, however some indication of what the variance of these trajectories looks like (or the average value of this across the UMAMP plot, etc), although optional, would also be another valuable point of comparison between the methods. Bootstrap resampling of the data and re-running the methods, for example, would likely give a good indication of the consequences of the behaviour seen in the sequencing depth studies.
Minor comments on specific sections:
Intro:
It would be good to define exactly what you mean by 'state' with respect to the expression (or expression programs, etc) up front here.
- lineages between states' sounds a little awkward to me, I would suggest something like 'state lineages' instead.
- During cellular transitions' - suggest removing and starting with 'scRNASeq data...'
- Also would be good to define concretely what you mean by 'trajectory' with respect to expression
- inconsistent or incorrect directionalities' - suggest 'inconsistent or incorrect trajectories'?
Perhaps move 'RNA velocity has been applied...' after the description of RNA velocity
- Given these limitations...' - it would perhaps be nice to give some vignettes of how the methods differ before discussing their limitations
- As the mRNA matures...' - perhaps mention the key piece of information is the splicing out of introns, and this permits identification of the mature and immature mRNA (otherwise it seems a bit vague), and/or define splicing in the text
- The method yields...' - redundant?
- ...linear differential equations with constant slope...' - unclear what 'constant slope' means here
- steady state solution' - not clear whether 'steady state' here is synonymous with 'equilibrium', so it would be good to define (e.g., whether this means alpha = gamma, or whether this is a statement on beta, etc).
- ..the directionality in the cell-cell graph,...' - not clear how this becomes a graph, so it would be good to expand on how this is obtained
- the results depend heavily on chosen hyperparameters' - not clear that the other methods don't have this issue. I would guess that this is a matter of degree, but it would be good to indicate whether there are specific reasons why some models are expected to be less sensitive by construction.
- treating them as probabilistic events and resulting in a Markov process' -> 'treating them as probabilistic events in a Markov process
- a dynamic model (scv-Dyn) to address many of the original issues' - not clear from this description how the methods fixes these issues. Perhaps also talk a bit more about this 'latent time' parameter, if it's useful to contextualise the results later.
- deriving gene-specific splicing parameters in a single step,' - not clear what doing this in a 'single step' is, or why it might be advantageous.
- graph convolution' -> graph convolutional
- ...but also its neighbors...' - neighbours in what space?
- the mouse pancreas, a well studied lineage' - phrasing sounds a bit weird, maybe 'with a well studied lineage'?
- RNA velocity streams' -> trajectories? (if defined before)
- Therefore, a general benchmark that compares RNA velocity methods...' It might be nice here also to explain what exactly one might expect to be the ground truth in this case, as it's not particularly clear that something like VeloCyto that apparently predicts basically nothing is a 'bad' result compared to something like scv-Sto which predicts much more complicated dynamics where there may not be any.
- Disparate or contradictory results from various RNA velocity methods undermine our confidence in the predicted trajectories.' -
As mentioned before, I think it is important here to add that these methods were each individually developed for a specific purpose, and likely with an aim to improve upon previous techniques in the literature. So I think this kind of statement would have to be motivated a bit better, if claiming this without specific reference to the designs of each of the techniques on their own merit. For example, one could imagine in future an 'oracle' method that somehow predicts all trajectories with 100% accuracy, but its results are rejected because they don't align with the earlier and more primitive methods in the field. Equivalently, it could be such that one model is designed specifically for a specific type of dataset, or in the low data size regime, so a comparison using other datasets is more unfair (I don't think that is necessarily the case here, but without surveying the rest of the literature, a reader would not know that).
Results:
- 30 nearest neighbors' - With a fixed number of neighbours rather than a fixed similarity, it seems like you might end up getting results that are hard to compare if this is calculated in a sparse region (or the datasets have significantly different sizes)?
- Cell types from well-defined lineages...', '...those with more complex cellular heterogeneity...' - it would be good to indicate how these are considered 'well defined' and what 'more complex cellular heterogeneity' refers to (whether this is just from the UMAP, or whether these are statements that include prior biological assumptions that are used to evaluate the method).
- ...their differentiation process is more complex..' ... '..lineages with complex diversity...' - is this a statement based purely on the variation in expression? If so, it would be good to be clear here.
- ...the landscape's smoothness varies depending on cell type.' - it's been a while since you mentioned 'landscape', so it would be good to remind the reader that this is the distribution of expression in your high dimensional space.
- correlated with cell diversity' -> cell type diversity? Although 'notochord, endoderm and hindbrain' have already been indicated as 'cell types' here, so this is a little confusing
- ...can indicate overtraining or over smoothing...' - this seems a little contradictory, where I would assume that overtraining would result in higher variance, and oversmoothing would give the opposite effect?
- Altogether, the variation in agreement...' - it would be nice to go a little further here and make specific recommendations, even if it's something very vague, as there will be cases where there are no clear biological clues.
Downstream:
- ..overlap in macrostates...' - even if macrostates is a term defined in CellRank, it would be good to redefine it here for the reader
- ..scv-Dyn and UniTVelo both utilize a shared latent time variable..' - is it possible to give any indication why this might lead to a difference? Or even which might be more plausible?
- RNA Velocity' -> RNA velocity
Robustness to sequencing depth:
- DeepVelo, scv-Dyn, and UniTVelo maintained low levels of correlation with the magnitudes from the full reads... - this seems like quite a startling effect. It seems like this indicates the models are really quite unstable, if the removal of 2% of the dataset gives such a considerable difference.
Discussion:
Our research emphasizes the importance of implementing a method that best fits the dataset...' - You don't indicate any goodness of fit metrics prior to this, so it's not really clear what this means in practice.
Because the pancreas dataset is often used as a benchmark dataset for RNA velocity methods...' - presumably also this could mean that methods are developed to overfit to this dataset?
Capturing the full splicing dynamics..' - not clear what 'full splicing dynamics' means here
Fig 1c. (and elsewhere):
Often it's hard to see the trajectory arrows in the rasterised plots. If this isn't just an issue with the review PDF, in Matplotlib it's possible to rasterise the points without rasterising the annotations and axes (https://matplotlib.org/stable/gallery/misc/rasterization_demo.html), which makes things a lot easier to read, but also doesn't leave you with giant file sizes.
Fig 2a.: The equation is a bit confusing, as k seems to be the size of the set of neighbours and the set itself, where in the text k is only ever the number of neighbours. Perhaps it would be better to have something like a set K of neighbours k (where k ∈ K), and then the sum is normalised by |K| and the sum is over k (or alternatively have k = |K| to be consistent with the text).
Fig 2d. (and others) A label on the z axis would be good here
Significance
This is a commendable effort, and will be of use to practitioners navigating the properties of current and future RNA velocity methods, particularly those without a background in the more advanced mathematical formulations of the newer methods. It also is the first time, to my knowledge, that a systematic comparison has been performed using a number of real datasets and with a novel metric.
However, the paper does not go as far as to link specific approaches or assumptions within the methods directly to their empirical observations, which could limit the applicability of the conclusions to only datasets and methods similar to those studied.
-