Optimal marker genes for c -separated cell types
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The identification of cell types in single-cell RNA-seq studies relies on the distinct expression signature of marker genes. A small set of target genes is also needed to design probes for targeted spatial transcriptomics experiments and to target proteins in single-cell spatial proteomics or for cell sorting. While traditional approaches have relied on testing one gene at a time for differential expression between a given cell type and the rest, more recent methods have highlighted the benefits of a joint selection of markers that together distinguish all pairs of cell types simultaneously. These combinatorial methods differ mostly in the notion of discrimination between cell types, using Euclidean or Manhattan distance in the dimensions of the selected marker genes, or the difference in the fraction of cells expressing a given marker gene. The resulting combinatorial optimization problems then seeks to identify a small set of genes that yield discrimination above a given threshold between all pairs of cell types. However, existing methods either consider all pairs of individual cells which becomes intractable even for medium-sized datasets, or ignore intra-cell type expression variation entirely by collapsing all cells of a given type to a single representative. Here we address these limitations and propose to find a small set of genes such that cell types are c -separated in the selected dimensions, a notion introduced previously in learning a mixture of Gaussians. To this end, we formulate a linear program that naturally takes into account expression variation within cell types without including each pair of individual cells in the model, leading to a highly stable set of marker genes that allow to accurately discriminate between cell types and that can be computed to optimality efficiently.