Comparative Analysis of common alignment tools for single cell RNA sequencing

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

With the rise of single cell RNA sequencing new bioinformatic tools became available to handle specific demands, such as quantifying unique molecular identifiers and correcting cell barcodes. Here, we analysed several datasets with the most common alignment tools for scRNA-seq data. We evaluated differences in the whitelisting, gene quantification, overall performance and potential variations in clustering or detection of differentially expressed genes.

We compared the tools Cell Ranger 5, STARsolo, Kallisto and Alevin on three published datasets for human and mouse, sequenced with different versions of the 10X sequencing protocol.

Striking differences have been observed in the overall runtime of the mappers. Besides that Kallisto and Alevin showed variances in the number of valid cells and detected genes per cell. Kallisto reported the highest number of cells, however, we observed an overrepresentation of cells with low gene content and unknown celtype. Conversely, Alevin rarely reported such low content cells.

Further variations were detected in the set of expressed genes. While STARsolo, Cell Ranger 5 and Alevin released similar gene sets, Kallisto detected additional genes from the Vmn and Olfr gene family, which are likely mapping artifacts. We also observed differences in the mitochondrial content of the resulting cells when comparing a prefiltered annotation set to the full annotation set that includes pseudogenes and other biotypes.

Overall, this study provides a detailed comparison of common scRNA-seq mappers and shows their specific properties on 10X Genomics data.

Key messages

  • Mapping and gene quantifications are the most resource and time intensive steps during the analysis of scRNA-Seq data.

  • The usage of alternative alignment tools reduces the time for analysing scRNA-Seq data.

  • Different mapping strategies influence key properties of scRNA-SEQ e.g. total cell counts or genes per cell

  • A better understanding of advantages and disadvantages for each mapping algorithm might improve analysis results.

Article activity feed

  1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac001), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 3: Hirak Sarkar

    Producing single-cell count matrix from the raw barcoded read sequences consists of several contributing steps such as whitelisting, correcting cell barcodes, resolving multi-mapped reads, etc. Each step can potentially introduce variability in the resulting count matrix depending on the specific algorithm adapted by the tool used. Bruning et al. attempted to disentangle these effects using the most popular scRNA-seq quantification tools such as Cell Ranger 5, STARsolo, Kallisto, and Alevin. The manuscript is well-written and would add considerable value to the broad single-cell research community. I have a few concerns about the current draft of the manuscript that can be addressed in a revision.

    • The scina tool is used to construct an "artificial ground truth". The consensus of two or more mappers are used to arrive at this reference annotation. In my opinion, the consensus can lead to a biased reference, especially since STARSolo and Cell Ranger5 follow a very similar pipeline; it is expected, by design, that those tools would have highly-overlapping results.

    I suggest that the simulated datasets from the pre-decided clusters might be more appropriate for an unbiased evaluation (The recent paper from Kaminow et al. https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1.full has similar simulations). Having said that, the current consensus-based analysis in my opinion should give a reasonable reference for most of the cells, but a more principled simulation is required to identify the extreme cases where each of the tools might show variable assignments.

    -The Sankey plots (Supp Figure 5) and the heatmaps (Supp Figure 6) represent the mutual agreement from different tools. As the scina clusters are used as ground truth, a more direct qualitative measure such as precision/recall would be more helpful.

    To be more specific, the resolution parameter of FindCluster could be tuned (now set to 0.12/0.15) to produce the same number of clusters present in the ground truth. Each predicted cluster can then be assigned to a ground truth cluster greedily. The number of mismapped cells can be further categorized as false-positive or false-negative.

    • The variability of different tools on the three real datasets is worth exploring in depth. For example, quoting from the paper, "Alevin detected more cells with less genes per cell in the PBMC and Endothelial dataset. However, it detected less cells with more genes per cell in the Cardiac dataset." It would be interesting to understand the origin of these variations and what authors hypothesize, e.g. apart from mapping/alignment there are other additional steps in the quantification pipeline that could potentially lead to variation in the detected cells and respective gene count. The tools can also have underlying algorithmic biases that are worth exploring.

    • "We could show that Alevin often detects unique barcodes, which were not identified by the other tools. These barcodes had very low UMI content and were not listed in the 10X whitelist.", the alevin -- whitelist option (https://salmon.readthedocs.io/en/develop/alevin.html#whitelist) enables use of any external filtered whitelist while running alevin. I wonder if using this option would change the behavior mentioned in the manuscript.

    • The manuscript raises the important question of multi-mapped reads across cell-types, it would be interesting to quantify the percentage of reads that are discarded as multi-mapped by different tools (those which discard). If that percentage is substantial, then the difference in handling such ambiguous reads through EM-like algorithm might be promising.

    Plots and Figures

    -Intersection Plots

    The minor differences in the $y$ axis of the intersection plots (Fig. 4, supp fig. 3 etc.) are not pronounced. (log-scale might help)

    Overview Figure The manuscript correctly pointed out how different intermediate steps contribute to the general variance in the downstream results. An overview figure with a flow chart of a typical scRNA-seq quantification pipeline will be beneficial.

    Minor Concerns

    There is a spelling mistake in the abstract celtype -> cell-type

    Possible incomplete sentence : "The recommended annotation from 10X, which only contains genes with the biotypes protein coding and long non-coding, might lead to an overestimation of mitochondrial gene expression respectively the absence of other gene types."

  2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac001), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2: Serghei Mangul

    1 -- Abstract contains. Confusing terminology, for example became available can be replaced by developed.

    2 -- Also analyzed several data sets, can be replaced by benchmarking to clear indicate that that refers to benchmarking rather than analysis. Some terminology needs to be explained. For example, white listing should be defined

    3 -- KALISTO is not alignment tool in a proper sense, as it doesn't report position of the read insteadonly the transcriptof origin. Instead, this is pseudo alignment. Alignment needs to be defined, or word pseudoalignment used

    4 -- How the ground truth or gold standard was defined ? Is the assumption of the paper that the tool with the highest number of mapped reads perform the best? This needs to be explained in the introduction.

    5 -- In general. I read alignment is artificial rather than biological problem, so that molecular gold standard cannot be defined. See for example https://www.nature.com/articles/s41467-019-09406-4. It would be helpful to explain this upfront when talking about gold standard and cite this.

    6 -- It is unclear how the tools were selected. What was the reasoning to select only 4 tools and how do offer know that those tools are common? For the complete list of RNA-based alignment tools author can refer to https://arxiv.org/abs/2003.00110 A reasonable criteria to select would be to take the tools, which are available, for example, in bioconda, which will make installing those tools easy. However, randomly selecting tools is not acceptable. For example, why the SALMON was not included. However, KALISTO was included.

    7 -- Language of the paper needs to be improved, for example, in the background section the word great was used, which can be replaced by a more appropriate scientific wording.

    8-- More explanation needs to be provided for cell ranger. Is it essentially the wrapper around the star? Does it have any novel Algorithms or software development involved?

    9-- Needs me to explain why they chose only 10x genomics among the available single cell platforms.

    10-- And the annotations indeed may influence, the alignment when they are provided for alignment tools. is every alignment tool able to take custom annotations?The paper is lacking the Figure providing results on which annotation performs the best for a given data sent.

    11-- Datasets and reference genomes section Gold standard data sets are not reported. It was not clear if the paper is having such data set or such data set is missing in case such data set, is missing. How the authors are able to say which read alignment tool performs the best ?

    12-- The paper contains a single human sample. Any particular reason for that? The paper would benefit from having multiple human samples as a as it was done for the mouse. Did the authors performed a systematic search to identify as many single cell sample as possible. If not, that will be desirable.

    13 -- Was that 10x data human data only available on 10x website, and not available on SRA or Geo

    14 -- Paper provides a GitHub link with data sets and the code used for this analysis. Does the GitHub has also the BAM files? If not, those needs to be uploaded. Additionally is the code and summary data behind the figures provided?

    15 -- Results section, the beginning of results section would benefit with the short description of the datasets, for example. How many samples were in total? What was the read length for each sample? what was the number of reads for each sample? Was a different. So providing the mean and the variance can be helpful.

    16 -- In general, figures needs to be improved in terms of visualization. It's very hard to understand what are the figures are trying to convey. For example, figure 2 is absolutely impossible to understand. And also, what is the purpose of that figure is also unclear? The same for the figure 3 It's very busy, figure. However, what it is trying to convey? It's hard to know.

    17 -- Figure 4 is also very hard to understand. So maybe making the log scale can improve. What is the X axis, for example, that's unclear those details. And in general figures needs to be improved.

    18 -- in general figures needs to be visually understandable and and more effective.

  3. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac001), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Bo Li

    Single-cell RNA-seq has revolutionized our abilities of investigating cell heterogeneity in complex tissue. Generating a high-quality gene count matrix is a critical first step for single-cell RNA-seq data analysis. Thus, a detailed comparison and benchmarking of available gene-count matrix generation tools, such as the work described in this manuscript, is a pressing need and has the potential to benefit the general community.

    Although this work has a great potential, the benchmarking efforts described in the manuscript are not comprehensive enough to justify its publication at GigaScience unless the authors address my following major and minor concerns.

    Major concerns:

    1. The authors should discuss related benchmarking efforts and the differences between previous work and this manuscript in the Background section instead of the Discussion section. For example, Du et al. 2020 G3: Genes, Genomics, Genetics. and Booeshaghi & Pacther bioRxiv 2021 should be mentioned and discussed in the Background section. In addition, STARsolo manuscript (https://www.biorxiv.org/content/10.1101/2021.05.05.442755v1), which contains a comprehensive comparison of CellRanger, STARsolo, Alevin and Kallisto-Bustools should be cited and discussed. Zakeri et al. 2021 bioRxiv (https://www.biorxiv.org/content/10.1101/2021.02.10.430656v1) should also be included and discussed in the Background section.

    2. Benchmark with latest versions of the software. The choice of Cell Ranger, STARsolo, Alevin and Kallisto-BUStools is good because they are four major gene count matrix generation tools. However, I urge the authors also include CellRanger v6 and Alevin-fry (Alevin_sketch/Alevin_partialdecoy/Alevin_full-decoy, see STARsolo manuscript), which are currently lacking, into their benchmarking efforts. The authors may also consider add STARsolo_sparseSA into the benchmark. Since single-cell RNA-seq tool development is a fast-evolving field, benchmarking of the up-to-date versions of tools is super critical for a benchmarking paper.

    3. Conclusions. The authors summarized the observed differences between tools based on the benchmarking results. This is good but very helpful for end-users. I recommend the authors to emphasize their recommendations for end-users more clearly in the discussion/results section. For example, do the authors recommend one tool over the others under certain circumstances? If so, which tool and which circumstance and why? I like Figure 5 a lot and hope the authors can summarize this figure better in the manuscript.

    4. This manuscript concluded that differential expression (DEG) results showed no major differences among the alignment tools (Figure 4). However, the STARsolo manuscript suggested DEG results are strongly influenced by quantification tools (Sec. 2.6, Figure 5). Please explain this discrepancy.

    5. This manuscript suggested simulated data is not as helpful as real data. However, the STARsolo manuscript reported drastic differences between tools using simulated data. Please comment on this discrepancy.

    6. I have big concerns regarding the filtered vs. unfiltered annotation comparison. In particular for pseudogenes, we know that many of them are merely transcribed or lowly transcribed. As a result, many of these pseudogenes would not be captured by the single-cell RNA-seq protocol. At the same time, because these pseudogenes share sequence similarities with functional genes, they would bring trouble for read mapping. This is one of the main reasons for using a carefully filtered annotation. Actually, whether and how to filter annotation is in active debate in big cell atlas consortia such as Human Cell Atlas. Thus, I would be super careful about describing results comparing filtered vs. unfiltered annotation. For example, in Suppl. Figure 8D, there are 6 mitochondrial genes that have 100% sequence similarity to their corresponding pseudogenes. It is impossible to distinguish if a read comes from a gene or a pseudogene for these 6 genes and it is also not necessary --- the transcribed RNA should also be exactly the same. Thus, I encourage the authors remove their pseudogenes from the annotation and I suspect the mouse data results should look similar to the human data in the Suppl. Figure 8A.

    7. The endothelial dataset was only run on CellRanger 3 because the UMI sequence is one base shorter. Could the authors augment the UMI sequence with one constant base and run this dataset through CellRanger 4/5/6?

    8. I think it is more appropriate to call the tools benchmarked as "gene count matrix generation tools" instead of "alignment tools".

    Minor concerns:

    1. The Suppl Table 2 mentioned in the main text corresponds to Suppl. Table 3 in the attachment. In addition, there is no reference to Suppl Table 2.

    2. Suppl Table 3 PBMC, why do I see endothelial cell markers in PBMC dataset?

    3. Suppl Figure 7 is never referenced in the main text.

    4. Suppl Figure 8D is never referenced in the main text.