Discovery of cell type classification marker genes from single cell RNA sequencing data using NS-Forest

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The use of single-cell transcriptomic technologies that quantitively describe cell transcriptional phenotypes using single cell/nucleus RNA sequencing (scRNA-seq) is revolutionizing our understanding of cell biology, leading to new insights in cell type identification, disease mechanisms, and drug development. The tremendous growth in scRNA-seq data has posed new challenges in eYiciently characterizing data-driven cell types and identifying quantifiable marker genes for cell type classification. The use of machine learning and explainable artificial intelligence has emerged as an eYective approach to study large-scale scRNA-seq data. NS-Forest is a random forest machine learning-based algorithm that aims to provide a scalable data-driven solution to identify minimum combinations of necessary and suYicient marker genes that capture cell type identity with maximum classification accuracy. Here, we describe the latest version, NS-Forest version 4.0 and its companion Python package ( https://github.com/JCVenterInstitute/NSForest ), with several enhancements, to select marker gene combinations that exhibit selective expression patterns among closely related cell types and more eYiciently perform marker gene selection for large-scale scRNA-seq data atlases with millions of cells. By modularizing the final decision tree step, NS-Forest v4.0 can be used to compare the performance of user-defined marker genes with the NS-Forest computationally-derived marker genes based on the decision tree classifiers. To quantify how well the identified markers exhibit the desired pattern of being exclusively expressed at high levels within their target cell types, we introduce the On-Target Fraction metric that ranges from 0 to1, with a metric of 1 given to markers that are only expressed within their target cell types and not in cells of any other cell types. We have applied NS-Forest v4.0 on scRNA-seq datasets from three human organs, including the brain, kidney, and lung. We observe that NS-Forest v4.0 outperforms previous versions on its ability to identify markers with higher On-Target Fraction values for closely related cell types and outperforms other marker gene selection approaches on the classification performance with significantly higher F-beta scores.

Article activity feed