Guiding clustering and annotation in single-cell RNA sequencing using the average overlap metric
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Defining cell types using unsupervised clustering algorithms based on transcriptional similarity is a powerful application of single-cell RNA sequencing. A single clustering resolution may not yield clusters that represent both broad, well-defined populations and smaller subpopulations simultaneously. Therefore, when cell identities are not known prior to sequencing, robust comparison and annotation of inferred de novo clusters remains a challenge. In this work, we define the distance between single-cell clusters by proposing the use of the average overlap metric to compare ranked lists of differentially expressed genes in a top-weighted manner. We first benchmark our approach in a truth-known dataset comprised of highly similar yet distinct T-cell populations and show that evaluating clusters with average overlap results in a consistent, precise, and biologically meaningful recapitulation of true cell identities. We then apply our approach to data of unsorted mouse thymocytes and characterize stages of T-cell development in the thymus, including minor populations of double-negative (CD4-CD8-) T-cells that are notoriously difficult to confidently detect in unsorted single-cell data. We demonstrate that measuring cluster similarity with average overlap of marker gene rankings enables robust, reproducible characterization of single cells and clarifies biological interpretation of their underlying identities in highly homogeneous populations.