A novel metric reveals previously unrecognized distortion in the analysis of scRNA-seq data

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

High-dimensional data are becoming increasingly common in nearly all areas of science. Developing approaches to analyze these data and understand their meaning is a pressing issue. This is particularly true for single-cell RNA-seq (scRNA-seq), a technique that simultaneously measures the expression of tens of thousands of genes in thousands to millions of single cells. Popular analysis pipelines significantly reduce the dimensionality of the dataset before performing downstream analysis. One problem with this approach is that dimensionality reduction can introduce substantial distortion into the data, particularly by disrupting the local neighborhoods of certain points. Since many scRNA-seq analyses like cell type clustering or trajectory inference rely on these near-neighbor relationships, distortion in this aspect of the data could significantly influence the outcomes of these analyses. Here, we introduce a straightforward approach to quantifying this distortion by comparing the local neighborhoods of points before and after dimensionality reduction. We found that popular techniques like t-SNE and UMAP introduce substantial distortion even for simple simulated data sets. For scRNA-seq data, we found the distortion in local neighborhoods was often greater than 95%, and that there was no consistent set of neighborhoods across the various steps in the consensus scRNA-seq analysis pipeline. We also found that this distortion had profound impacts on the outcomes of cell type clustering and other downstream analyses. Our findings suggest that caution must be applied when interpreting results in terms of 2-D visualizations produced by tools like UMAP, and that there is a critical need for new dimensionality reduction tools that more effectively preserve the local topological structure of the data.

Article activity feed