TooManyCellsInteractive: a visualization tool for dynamic exploration of single-cell data

This article has been Reviewed by the following groups

Read the full article

Abstract

As single-cell sequencing data sets grow in size, visualizations of large cellular populations become difficult to parse and require extensive processing to identify subpopulations of cells. Managing many of these charts is laborious for technical users and unintuitive for non-technical users. To address this issue, we developed TooManyCellsInteractive (TMCI), a browser-based JavaScript application for visualizing hierarchical cellular populations as an interactive radial tree. TMCI allows users to explore, filter, and manipulate hierarchical data structures through an intuitive interface while also enabling batch export of high-quality custom graphics. Here we describe the software architecture and illustrate how TMCI has identified unique survival pathways among drug-tolerant persister cells in a pan-cancer analysis. TMCI will help guide increasingly large data visualizations and facilitate multi-resolution data exploration in a user-friendly way.

Article activity feed

  1. AbstractAs single-cell sequencing data sets grow in size, visualizations of large cellular populations become difficult to parse and require extensive processing to identify subpopulations of cells. Managing many of these charts is laborious for technical users and unintuitive for non-technical users. To address this issue, we developed TooManyCellsInteractive (TMCI), a browser-based JavaScript application for visualizing hierarchical cellular populations as an interactive radial tree. TMCI allows users to explore, filter, and manipulate hierarchical data structures through an intuitive interface while also enabling batch export of high-quality custom graphics. Here we describe the software architecture and illustrate how TMCI has identified unique survival pathways among drug-tolerant persister cells in a pan-cancer analysis. TMCI will help guide increasingly large data visualizations and facilitate multi-resolution data exploration in a user-friendly way.

    A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae056), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

    Reviewer 3: Georgios Fotakis

    1. General comments In this manuscript the authors present TooManyCellsInteractive (TMCI), a browser-based TypeScript graphical user interface for the visualization and interactive exploration of single-cell data. TMCI facilitates the visualization of single-cell data by representing it as a radial tree of nested cell clusters. It relies on TooManyCells, a suite of tools designed for multi-resolution and multifaceted exploration of single-cell clades based on a matrix-free divisive hierarchical spectral clustering method. A key advantage of TCMI lies in its capability to provide a quantitative depiction of relationships among clusters, allowing for the delineation of context-dependent rare and abundant cell populations, as showcased in the original publication [1] and in the present manuscript. TMCI extends the capabilities of TMC significantly, notably enhancing computational performance, particularly in scenarios where multiple features are overlaid (an improvement that is attributed to the persistent feature of the PostgreSQL database).

    A notable aspect of this manuscript is the fact that the authors performed a benchmark using publicly available scRNAseq datasets. This benchmark highlights TMCI's superior performance over TMC and its comparable performance to two other commonly utilized tools (Cirrocumulus and CELLxGENE). Moreover, the authors showcase TMCI's applicability through aggregating publicly available scRNAseq data. Here, they successfully delineate sub-populations of cancer drug-tolerant persister cells by employing minimum distance search pruning, enhancing the visibility of small sub-populations. Additionally, the authors note an increase in ID2 gene expression among persister-cell populations, as well as the enrichment of unique biological programs between short- and long-term persister-cell populations. Furthermore, they observe an upregulation of the diapause gene signature across all treated sub-populations. The biological insights the authors glean are novel and highly intriguing. In general, this manuscript is well written, with the authors offering comprehensive documentation that covers the essential steps for installing and running TMCI through their GitHub repository. Additionally, they provide a minimal dataset as an example for users. However, there are a few minor adjustments that, once implemented, would enhance the manuscript's value by improving clarity and providing valuable insights to the field.

    1. Specific comments for revision a) Major
    • As stated in the manuscript's abstract, visualising large cell populations from single-cell atlases poses greater challenges and demands compute-intensive processes. One of my major concerns revolves around TMCI's scalability when handling large datasets. The authors conducted benchmarking on relatively modest datasets (ranging from 18,859 to 54,220 cells). Based on the data provided in Supplementary Table S3, while TMCI demonstrates comparable performance to CELLxGENE on the Tabula Muris dataset and its subset (with mean memory consumption differences ranging from 870 MB to 1.8 GB), the disparity significantly increases when loading and rendering visualizations of the larger dataset, reaching 8.5 GB of RAM. It would be of great interest if the authors conducted a similar benchmark using a larger dataset to elucidate how TMCI scales with increased cell numbers, especially considering the trend in the field towards single-cell atlases and the availability of datasets consisting up to millions of cells (like the Tabula Sapiens [2] dataset or similar [3, 4]).

    • In the "Results" section, under the title "TMCI identifies sub-populations with highly expressed diapause programs," the authors assert that "the significantly different sub-populations were more easily seen in TMCI's tree". Since perception can be subjective (for instance, a user more accustomed to UMAP plots may find it challenging to interpret a tree representation), it would be beneficial for the authors to allocate a section of the supplementary material to demonstrate the clarity advantages of TMCI's tree visualization. One approach could involve a side-by-side comparison of visualizations generated by TMCI and CELLxGENE using the same color scheme. For instance, Figure 4b could be compared with Supplementary Figure S1g, Figure 4j with Supplementary Figure S1h, and so forth.

    • The "Discussion" section overlooks the future prospects of TMCI. As demonstrated in the case study, TMCI exhibits potential beyond serving as a visualization tool for identifying tree-based relationships in single-cell data. Are there any plans for integrating analytical functionalities to provide insights into cellular compositions and underlying biology, such as marker gene identification, differential gene expression analysis, and gene set enrichment analysis? In the future, could TMCI support the visualization of such results using methods like violin plots, heatmaps, and others?

    • In the "Materials and Methods'' section, the authors outline the process of aggregating the scRNAseq datasets used for the case study, including filtering and normalization steps. However, scRNAseq technologies are prone to significant noise resulting from amplification and dropout events. Additionally, when integrating different scRNAseq datasets, users need to consider potential batch effects. Did the authors employ any de-noising or batch correction methods? If not, what was the rationale behind this decision? It would be intriguing to observe any potential differences in the results following the application of such methods.

    • Remaining within the "Materials and Methods" section, providing a brief description of the methods and tools utilized for the differential gene expression analysis, the GSEA (if not solely conducted through Metascape), and the packages utilized to generate the plots in Figures 3 and 4 would enhance clarity and facilitate reproducibility.

    • Figure 4 - b: Distinguishing between the various cell lines on the partitioned nodes based on the current color coding—particularly for the MDA-MB-231 and PC9 cell lines, as well as between the treated and untreated populations of the SK-MEL-28 cell line—is quite challenging. Employing a different color scheme would significantly enhance clarity, making the different cell populations more distinguishable.

    • Figure 4 - d and k: The authors should add statistics as relying solely on the box and whisker plots makes it challenging to ascertain whether there is a significant difference between the conditions. For instance, it appears that ID2 is over-expressed between the control and treated population only in the SK-MEL-28 cell line.

    b) Minor

    • In the "Results" section, under the title "TMCI reduces time to display trees," the authors state: "these benchmarks indicate not only the superior performance of TMCI to generate static and interactive tree of single-cell data compared to other tools…". However, based on the results presented in the manuscript and the supplementary material, it seems that TMCI may not be outperforming alternative interactive visualization methods. This phrase should be revised to accurately reflect the benchmark results.

    References

    1. Schwartz GW, Zhou Y, Petrovic J, Fasolino M, et al. TooManyCells identifies and visualizes relationships of single-cell clades. Nat Methods 2020;17(4):405-413. PMID: 32123397
    2. The Tabula Sapiens Consortium, The Tabula Sapiens: A multiple-organ, single-cell transcriptomic atlas of humans. Science 2022;376, eabl4896. DOI:10.1126/science.abl4896
    3. Sikkema L, Ramírez-Suástegui C, Strobl DC, et al. An integrated cell atlas of the lung in health and disease. Nat Med 2023;29, 1563-1577. DOI:10.1038/s41591-023-02327-2
    4. Salcher S, Sturm G, Horvath L, et al. High-resolution single-cell atlas reveals diversity and plasticity of tissue-resident neutrophils in non-small cell lung cancer. Cancer cell 2022;40(12):1503-1520.E8. DOI:10.1016/j.ccell.2022.10.008
  2. AbstractAs single-cell sequencing data sets grow in size, visualizations of large cellular populations become difficult to parse and require extensive processing to identify subpopulations of cells. Managing many of these charts is laborious for technical users and unintuitive for non-technical users. To address this issue, we developed TooManyCellsInteractive (TMCI), a browser-based JavaScript application for visualizing hierarchical cellular populations as an interactive radial tree. TMCI allows users to explore, filter, and manipulate hierarchical data structures through an intuitive interface while also enabling batch export of high-quality custom graphics. Here we describe the software architecture and illustrate how TMCI has identified unique survival pathways among drug-tolerant persister cells in a pan-cancer analysis. TMCI will help guide increasingly large data visualizations and facilitate multi-resolution data exploration in a user-friendly way.

    A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae056), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

    Reviewer 2: Mehmet Tekman

    PAPER: TOOMANYCELLSINTERACTIVE REVIEW


    Table of Contents


    1. Using the Application .. 1. Positive Notes: ..... 1. General UI and Execution .. 2. Negative Notes: ..... 1. Controls ..... 2. Documentation ..... 3. Feature Overlays:
    2. Docker / Postgreseql
    3. Ethos of the Introduction

    The manuscript reads very well, and the quality of the language is good.

    This review tests the application itself, and makes some comment about some ambiguous wording in the introduction

    1 Using the Application

    I tested the Interactive Display at https://tmci.schwartzlab.ca/

    1.1 Positive Notes:

    
    1.1.1 General UI and Execution
    ------------------------------
    
    The general interactivity of the UI was very impressive and expressive. I liked that every aspect including the pies and the lines themselves could be coloured and scaled.
    
    I found the feature overlays and pruning history stack very intuitive, as well as rolling back the history on each state change.
    
    The choice of D3 was a good one, enabling very pleasing animations enter/exit/update state animations, as well as ease of SVG export.
    
    The inclusion of a command line `generate-svg.sh' for rendering without a browser is very useful.
    
    
    1.2 Negative Notes:
    

    1.2.1 Controls

    At first I wasn't able to find the controls, despite having the page open to 1330px wide, but then I realised I had to scroll down outside of the SVG container to find them.

    As mentioned in a recently opened PR, there's a CSS media rule `@media only screen and (min-width:1238px)' taking place, that looks strange on my Firefox 122 on Linux. Maybe better media rules for screens in the 700-900px wide range might be useful, as well as making separate rules for smartphones.

    1.2.2 Documentation

    Typescript is a good language to develop in, and lends itself naturally to documentation, though I did notice a distinct lack of documentation above many functions in the code base.

    Perhaps write a bit more documentation to make the code base accessible to new collaborators?

    Otherwise, the quality of code looked good, and the license was GPLv3 which is always welcome.

    1.2.3 Feature Overlays:

    I found the feature overlays super useful, though limited by the number of colours. These appear to be limited to one colour for all genes.

    Very useful for showing multiple genes, but it would be nice to have the ability to colour the expression of different genes with different colours, at least for < 3 genes of interest (due to the difficult colour mixing constraints).

    2 Docker / Postgreseql

    It is not clear to me what the Node server and PostgresQL database run in the docker container are actually doing, other than fetching cell metadata and marking user subsets from pruning actions.

    Could this not have been implemented in Javascript (e.g. IndexedDB)? Why does the data need to be hosted, if it's the user loading it from their own machine anyway. Is the idea that the visualization should be shared by multiple users who will be accessing the same dataset?

    If this is a single-user analysis, then why not keep all the computation and retrieval on the client-side?

    The reason I'm asking this is because I believe that by keeping the database operations within Javascript, you could run the system within a single Conda environment, or even with pure Node lockfile.

    I can understand needing a Docker for development purposes, but to actually run the software itself seems excessive. Is it not possible to separate the client and server into Conda? That way, one could then include the vizualisation (as the end stage) in bioinformatic pipelines.

    3 Ethos of the Introduction

    This is a small wording complaint in the Introduction section.

    TooManyCellsInteractive (TMCI) presents itself as a solution to the conventional scRNA-seq workflows that prepare the data via the usual: data → PCA → UMAP→ kNN → clustering stages.

    TMCI hints that it as an alternative solution to this workflow, but from what I can see in the documentation, it appears to require a cluster_tree.json' file, one that is produced only by the TooManyCells (TMC) pipeline.

    Unless I've misunderstood, it's not accurate to say that TMCI is an alternative to these conventional workflows, but that TMC is.

    TMCI simply consumes the files output by TMC and renders them. If what I'm saying is true, then the introduction should reflect that.

  3. AbstractAs single-cell sequencing data sets grow in size, visualizations of large cellular populations become difficult to parse and require extensive processing to identify subpopulations of cells. Managing many of these charts is laborious for technical users and unintuitive for non-technical users. To address this issue, we developed TooManyCellsInteractive (TMCI), a browser-based JavaScript application for visualizing hierarchical cellular populations as an interactive radial tree. TMCI allows users to explore, filter, and manipulate hierarchical data structures through an intuitive interface while also enabling batch export of high-quality custom graphics. Here we describe the software architecture and illustrate how TMCI has identified unique survival pathways among drug-tolerant persister cells in a pan-cancer analysis. TMCI will help guide increasingly large data visualizations and facilitate multi-resolution data exploration in a user-friendly way.

    A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giae056), where the paper and peer reviews are published openly under a CC-BY 4.0 license. These peer reviews were as follows:

    Reviewer 1: Qingnan Liang

    Klamann et al. report a tool for single-cell data visualization, TMCI, which was related to the previous method of TMC. It is appreciated to see such continuous work and maintenance of the method and I do agree TMCI has the potential of promoting the application of TMC. The manuscript is generally well-written, and it suits well with the scope of GigaScience. The TMCI is publicly available with reasonably detailed tutorials. In this manuscript, however, at several points the elaboration does not provide sufficient details or rationales. I suggest revision/clarification as below before recommendation to publish.

    1. Does TMCI provide an interface with one or more popular single-cell frameworks, such as SingelCellExperiment, Seurat, or Scanpy? A TMCI user would probably use one of these frameworks to do other parts of the analysis.
    2. Is batch effect considered in the drug-treated data example? More generally, if a user want to use TMCI with multiple datasets, what would be the recommended approach for batch effect? Also, we know cell cycle is a factor that are usually 'regressed out' for single-cell analysis. Does TMC/TMCI consider this?
    3. "To normalize cells between data sets, we used term frequency-inverse document frequency to weigh genes such that more frequent genes across cells had less impact on downstream clustering analyses" We know TF-IDF is becoming a common practice in scATAC-seq analysis. Is this TF-IDF approach common for tree construction (or hierarchical clustering) with high dimensional data? Is this recommended for all users with scRNA-seq data?
    4. Figure 4C is not very easy to read. It may be helpful to label/highlight the comparison pairs to make the point.
    5. Also it is not sufficiently emphasized that how TMCI helped finding this ID2 target. Or how such visualization would trigger interesting downstream approaches. I guess the power of this tree approach is somehow similar to the increasingly popular 'metacell' approach, which combine similar cells to 'cell states'. Thus it makes an interesting midpoint between 'single-cell' and 'pseudo-bulk'. It would really be helpful to see that some states (nodes), although similarly treated, behave differently than others, if there are such cases (not sure if cell lines have such heterogeneity). Similar comments for the pathway analysis part.