Finding stable clusterings of single-cell RNA-seq data

Victor Klebanoff

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Given a UMI count matrix, cluster its cells. Suppose that a matrix of the same size, for additional cells from the same experiment, becomes available. Combine the matrices and find a clustering with the same number of clusters. Compare results for the original cells. If they do not agree well, conclude that the initial clustering is unstable.

Although this setup is unrealistic, it is practical to reverse the perspective: given a clustering, analyze samples of half of the cells . If results for enough samples are consistent with the results for all cells, conclude that the clustering is stable.

We propose using divisive hierarchical spectral clustering and describe a possibly-novel mapping of the tree it produces to a set of nested clusterings.

Positive affinities are defined for points (representing cells in Euclidean space) that are k -nearest neighbors ( k is an input parameter). The affinity equals the inverse of the distance between the points. Ng, Jordan, and Weiss’ algorithm divides a set of points into two clusters. The normalized cut measures the clusters’ separation. Recursion generates a hierarchy of clusters.

Viewing clusters as nodes of a tree, set the length of the branch between a node and each of its daughters to the normalized cut. The better the separation between the daughter clusters, the smaller the normalized cut, hence the closer they are to their parent. Nodes’ distances from the root define the mapping of the tree to nested clusterings. For four large data sets, clusterings were found that are compatible with published results.

Analysis is performed for all cells and for multiple pairs of complementary samples of cells. For a given number of clusters, each sample’s clustering and clusters are compared to those of the full data set (restricted to the sample). This provides measures of the stability of the clustering and its clusters. Criteria to define stable clusterings and clusters are proposed. For two of the four large data sets, the clusterings compatible with published results were judged to be stable.

The method supports single-factor batch correction.

Version published to 10.1101/2025.09.17.672302 on bioRxiv
Sep 19, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed