Triku: a feature selection method based on nearest neighbors for single-cell data

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Feature selection is a relevant step in the analysis of single-cell RNA sequencing datasets. Triku is a feature selection method that favours genes defining the main cell populations. It does so by selecting genes expressed by groups of cells that are close in the nearest neighbor graph. Triku efficiently recovers cell populations present in artificial and biological benchmarking datasets, based on mutual information and silhouette coefficient measurements. Additionally, gene sets selected by triku are more likely to be related to relevant Gene Ontology terms, and contain fewer ribosomal and mitochondrial genes. Triku is available at https://gitlab.com/alexmascension/triku .

Article activity feed

  1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac017), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 3: James J Cai

    This manuscript introduces a k-NN based feature selection method, Triku, as one key step to secure informative features in analyzing single cell RNA sequencing datasets. The authors argue that most of the current feature selection methods bias to the highly expressed genes instead of the actual gene markers defining the cell populations. Instead, they focus on the local signature of gene expression for each gene and compute how each of them deviates from their null distributions. The ranked gene list concerning the deviation will be derived after the median correction. The authors use Silhouette coefficient to validate their conclusion of better modularity by comparing to other methods. Additionally, the randomness and the robustness of the method are well discussed. In general, this article is well-organized and well-written. The examples of artificial and benchmark datasets showing certain aspects of improvements compared to current methods are illustrative. Triku will be a valuable contribution to the single cell analysis field. The reviewer has some minor comments to help improve the manuscript further:

    1. The authors compare Triku to many other widely-used benchmark methods but excluding Seurat. Although Seurat method is adopted in Scanpy, as they claim in the "FS methods", the default flavor of Scanpy is "Seurat" instead of "Seurat_v3", the default feature selection method in the latest version of Seurat. It might be good to make it clear. Also, another alternative yet popular method, sctransform, from Seurat is not on the comparing list.

    2. The evidence of "we observed that in certain datasets the Wasserstein distances tend to slightly increase with the mean expression of the genes" could be shown to introduce the necessity of further correction. And the reason why the median correction outperforms other correction methods is left unexplained. For example, Seurat, which also considers binning correction method, uses mean to control the strong relationship between variability and average expression.

    3. Since the authors integrate into the pipeline the k-NN module, which is considered computationally expensive, it would be great to evaluate the time complexity/running speed compared with other methods.

    4. Triku assumes that the local transcriptomic similarity is more likely to define cell types. Apart from clustering, which might be better-quality after Triku, it would be interesting to show any potential effects to other popular downstream analyses in the single cell field, such as trajectory inference, given that Triku is subject to locality.

    5. Triku builds k-NN graph on UMAP all the way around. To validate the robustness of Triku, one could also discuss alternative low embedding methods like t-SNE in the section of "robustness".

    6. Since Triku is likely to identify locally over-expressed genes, it would be interesting to see the overlap between features selected by Triku and the differential expressed genes, if the setting is possible to arrange to make the two comparable.

    7. In the section of previous work, some claims were made without references. For instance, "Early methods for FS in scRNA-seq data were based on the idea that genes whose expression show a greater dispersion across the dataset are the ones that best capture the biological structure of the dataset". Another example of relevant references missing is https://pubmed.ncbi.nlm.nih.gov/31861624/.

    8. Fig. S2 does not show exact gene names. For artificial data, why those four genes are representative is left unexplained.

    9. The authors classify reference 11, the dropout-based method as "a new generation". As far as I know, the benchmark M3Drop was published in 2018.

    This Reviewer's comments were prepared with assistance from my graduate student Yongjian Yang.

  2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac017), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2: Rhonda Bacher

    The manuscript presents a new method, Triku, for feature selection in single-cell RNA-seq data. Feature selection is performed upstream of tasks such as clustering and differential expression to reduce the effect of genes with noisy expression. Triku uses a KNN approach to identify features that are unexpected within cells that are transcriptionally close. Overall, the manuscript is well-written, presented clearly, and is a promising new method for feature selection. The figures are also very nice.

    Major:

    1. In Figure 4, it is not obvious why different methods would rank so differently between the two datasets. What methods did those papers originally use for feature selection (if available). Does that partially explain the differences?

    2. Figure 6, the left-most plot does not belong? It is not described in the legend.

    3. It would be helpful to note somewhere which category of methods the others belong to (i.e. variance based or distribution based).

    4. Some additional results and discussion on the number of genes selected. 250-500 is quite low and may explain the poor overlap between genes selected. In my experience with commonly used methods from the scran or Seurat package a more typical number of genes selected is around 2,000. What are the typical numbers used/recommended for the other methods compared to here? Does the performance difference remain when expanded to the top 2,000 genes? And is the performance better for Triku on 250 compared to 2,000?

    5. In methods, "By default, the number of features is the one automatically selected by triku." These values should be put into the supplement to get a better idea of how many genes are being selected by default.

    Minor:

    1. In Figure 3, I would label the top and bottom as A and B, I initially misread the legend as top 250 and bottom 500 genes.

    2. What are the approximate run times a user can expect for this method?

  3. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac017), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Christoph Ziegenhain

    In the present manuscript, Ascension and colleagues introduce a new feature selection method for scRNA-seq data to increase the relevance of selected genes for downstream analysis such as clustering. I am happy to see that the tool is deposited as an open-source package that seems easy to install and plugs in seamlessly into the commonly used scanpy workflow / AnnData data structure. The documentation is sufficient to get users started. While the method has merit as a smarter approach to feature selection, the manuscript would benefit from some additional work in terms of both text and analysis.

    Major points

    1. While Triku's strategy is being introduced as superior to preexisting methods, it seems that the strong improvements (at least for the NMI summary statistic) in synthetic data turns rather incremental in the real world datasets of Mereu and Deng et al. The authors should discuss reasons for this difference. In the light of small differences and the fact that the performance is only measured in abstract summarized scores, it would be more convincing if the authors presented concrete cases where the application of Triku yields a difference in clustering or downstream analysis of biological relevance. The currently presented Gene Ontology / Geneset enrichment analysis are too diffuse and do not provide the reader with a feeling of the impact Triku could make on their analysis.

    2. Comparison to other FS methods: Currently, the most widely used method would probably be Seurat's FindVariableFeatures. It would be good to run the presented example data also via Seurat and include it in all comparisons (eg. Fig. 3-6).

    3. Precision of text: There are quite a few statements throughout the text that seem slightly inaccurate and the authors should work in their revision on precision and guiding the reader better through the background & performed work with a bit more clarity. Example: discussion of observed zeroes in UMI-data being well described by the Poisson or NB distributions was not realized by Svensson et al but rather had been described several years before. Compare Vieth et al., 2017 Bioinformatics & Chen et al., 2018 Genome Biology

    4. One of the main assumptions of Triku is that import genes get "switched on", ie. change their state from rather not expressed to a relatively high expression level. I am wondering if the authors can comment on the performance of Triku in cases where the main difference between cells is a gradual change in already expressed genes and whether such difference might get lost/masked by the selection performed by Triku.

    Minor points

    1. What is the rationale for selecting the % of zero expression as the descriptive statistics within the knn neighborhood? If a gene occurs in less cells but with higher expression, it's dispersion would be higher too. It would be needed to justify this more precisely and ideally the authors would add a version of Triku that works on dispersion (to show possible differences).

    2. Three main types of feature selection methods are introduced but not defined/explained further (p. 2)

    3. Since Triku performs more calculations/steps than existing methods for FS, the runtime is presumably higher. The authors should compare and comment on runtime.