Benchmarking Machine Learning Models for Cell Type Annotation in Single-Cell vs Single-Nucleus RNA-Seq Data
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Machine learning (ML) models can automate cell annotation and reduce human bias. However, it remains unclear which ML model best suits the characteristics of single-cell RNA sequencing data and whether a trained model can be applied to transcriptomes collected from nuclei rather than whole cells. This study evaluates the performance of eight selected ML models for cell annotation in single-cell (scRNA-seq) vs single-nucleus (snRNA-seq) RNA sequencing datasets, focusing on their ability to generalize across datasets with varying cell populations and transcriptome isolation techniques. Results In the first part, we use two publicly available scRNA-seq datasets of Peripheral Blood Mononuclear Cells (PBMC3K and PBMC10K) to assess the performance of each ML model in cell type classification within and across datasets. XGBoost achieved high accuracy (95.4%-95.8%), precision, and F1-scores, outperforming simpler models like Logistic Regression and Naive Bayes. Ensemble methods like XGBoost and Random Forest demonstrated strong precision and recall. Elastic Net demonstrated nearly as good generalizability achieving high accuracy (94.7%-95.1%). In the second part, we investigated the impact of transcriptome isolation techniques (single-cell vs. single-nucleus RNA-seq) on ML model performance using the publicly available cardiomyocyte differentiation datasets (GSE129096). Although models like XGBoost and Elastic Net excelled in single-cell data (accuracy and F1-scores > 95%), performance declined notably in single-nucleus data, suggesting inherent transcriptomic differences can impact ML model classification capacity. Notably, all models struggled with classifying intermediate-stage cells, highlighting challenges in distinguishing transitional cell populations, such as cardiac progenitors that retain stem cell markers while showing expression of differentiated cell markers. Conclusion ML models can be trained and applied to classify cells origination from both scRNA-seq and snRNA-seq. Ensemble tree-based models and penalized elastic regression demonstrated superior performance and generalizability across diverse datasets, emphasizing the importance of model selection for robust cell annotation. These findings underscore the need for tailored computational approaches when working with heterogeneous transcriptome data.