Benchmarking Machine Learning Models for Cell Type Annotation in Single-Cell vs Single-Nucleus RNA-Seq Data

Giovane Tortelote

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background Machine learning (ML) models can automate cell annotation and reduce human bias. However, it remains unclear which ML model best suits the characteristics of single-cell RNA sequencing data and whether a trained model can be applied to transcriptomes collected from nuclei rather than whole cells. This study evaluates the performance of eight selected ML models for cell annotation in single-cell (scRNA-seq) vs single-nucleus (snRNA-seq) RNA sequencing datasets, focusing on their ability to generalize across datasets with varying cell populations and transcriptome isolation techniques. Results In the first part, we use two publicly available scRNA-seq datasets of Peripheral Blood Mononuclear Cells (PBMC3K and PBMC10K) to assess the performance of each ML model in cell type classification within and across datasets. XGBoost achieved high accuracy (95.4%-95.8%), precision, and F1-scores, outperforming simpler models like Logistic Regression and Naive Bayes. Ensemble methods like XGBoost and Random Forest demonstrated strong precision and recall. Elastic Net demonstrated nearly as good generalizability achieving high accuracy (94.7%-95.1%). In the second part, we investigated the impact of transcriptome isolation techniques (single-cell vs. single-nucleus RNA-seq) on ML model performance using the publicly available cardiomyocyte differentiation datasets (GSE129096). Although models like XGBoost and Elastic Net excelled in single-cell data (accuracy and F1-scores > 95%), performance declined notably in single-nucleus data, suggesting inherent transcriptomic differences can impact ML model classification capacity. Notably, all models struggled with classifying intermediate-stage cells, highlighting challenges in distinguishing transitional cell populations, such as cardiac progenitors that retain stem cell markers while showing expression of differentiated cell markers. Conclusion ML models can be trained and applied to classify cells origination from both scRNA-seq and snRNA-seq. Ensemble tree-based models and penalized elastic regression demonstrated superior performance and generalizability across diverse datasets, emphasizing the importance of model selection for robust cell annotation. These findings underscore the need for tailored computational approaches when working with heterogeneous transcriptome data.

Version published to 10.21203/rs.3.rs-5754289/v1 on Research Square
Jan 8, 2025

Cell-type-specific transcriptomic signatures associated with Alzheimer’s disease in the ROSMAP cohort: a single-nucleus RNA-seq pseudobulk analysis.

This article has 1 author:
1. Jose Israel Nadal Vidal
This article has no evaluationsLatest version Jan 6, 2026
Discovering cell types and states from reference atlases with heterogeneous single-cell ATAC-seq features

This article has 2 authors:
1. Xiuwei Zhang
2. Yuqi Cheng
This article has no evaluationsLatest version Dec 10, 2025
An integrated single-cell transcriptomic dataset for Mouse cortex

This article has 8 authors:
1. Xuefeng Shi
2. Zhihui Qi
3. Hong Huang
4. Zhiming Ye
5. YuMin Wu
6. Kahei Chan
7. Maojin Yao
8. Zhongxing Wang
This article has no evaluationsLatest version Dec 18, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Cell-type-specific transcriptomic signatures associated with Alzheimer’s disease in the ROSMAP cohort: a single-nucleus RNA-seq pseudobulk analysis.

Discovering cell types and states from reference atlases with heterogeneous single-cell ATAC-seq features

An integrated single-cell transcriptomic dataset for Mouse cortex