Finding stable clusterings of single-cell RNA-seq data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

A sampling-based method that can identify stable (replicable) clusterings of cells for data presented as UMI counts is described. The structure of the processing pipeline is conventional: filter and transform counts, restrict to data for highly variable genes, reduce dimensionality, and cluster cells.

Divisive (binary) hierarchical spectral clustering is used. We propose what may be a novel method to map a clustering tree to a set of nested clusterings.

For spectral clustering, non-zero affinities are defined for points that are k -nearest neighbors ( k is an input parameter). The affinity equals the inverse of the distance between the points. This led to exploration of the variation of the distance between points (that represent cells in low-dimensional Euclidean space) that are k -nearest neighbors. Variation can be large – ranging over three orders of magnitude for one data set studied. This may have implications for other clustering schemes.

Given a set of points, Ng, Jordan, and Weiss’ algorithm is used to divide it into two clusters. Repeating for each daughter cluster – and its descendants – generates a clustering tree. Because the algorithm splits a set of points into two subsets, the points are mapped to two-dimensional Euclidean space for clustering. The clusters’ separation is measured by a quantity, H, calculated in two dimensions, which is formally identical to the F-statistic, equal to the between-cluster sum of squares divided by the within-cluster sum of squares, scaled by degrees of freedom. The larger H is, the greater the separation between the clusters.

Each cluster corresponds to a node of the clustering tree. Dividing a set of points into two subsets corresponds to defining two daughter nodes. Assign the length of the branch between a node and each of its daughters to equal 1/H. That is, the larger the separation between the daughter clusters, the closer they are (viewed as nodes) to their parent node in the tree. Nodes’ distances from the root define the mapping of the tree to a set of nested clusterings.

Analysis is performed for all cells and for multiple pairs of complementary samples of cells. For a given number of clusters, each sample’s clustering and clusters are compared to those of the full data set (restricted to the sample). If differences are small for all samples, the clustering may be considered stable.

The method supports single-factor batch correction.

Preliminary analysis not discussed here suggests that differential expression can contribute to evaluating stability.

Article activity feed