Cross-Dataset Identification of Human Disease-Specific Cell Subtypes Enabled by the Gene Print-based Algorithm--gPRINT

Ruojin Yan
Chunmei Fan
Shen Gu
Tingzhang Wang
Zi Yin
Xiao CHEN

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Despite extensive efforts in developing cell annotation algorithms for single cell RNA sequencing results, most algorithms fail to achieve cross-dataset mapping of cell subtypes due to factors such as batch effects between datasets. This limitation is particularly evident when rapidly annotating disease-specific cell subtypes across multiple datasets. In this study, we present gPRINT, a machine learning tool that utilizes the unique one-dimensional “gene print” expression patterns of individual cells. gPRINT is capable of automatically predicting cell types and annotating disease-specific cell subtypes. The development of gPRINT involved curation and harmonization of public datasets, algorithm validation within and across datasets, and the annotation of disease-specific fibroblast subtypes across various disease subgroups and datasets. Additionally, we created a preliminary single-cell atlas of human tendinopathy fibroblasts and successfully achieved automatic prediction of disease-specific cell subtypes in tendon disease. Furthermore, we conducted an exploration of key targets and related drugs specific to this subtype in tendon disease. The proposed approach offers an automated and unified method for identifying disease-specific cell subtypes across datasets, serving as a valuable reference for annotating fibroblast-specific subtypes in different disease states and facilitating the exploration of therapeutic targets in tendon disease.

Arcadia Science
Nov 6, 2023
Application of thisalgorithm to single-cell data from other biological species can increase our understanding ofbiodiversity

I have some questions about this statement.

do you know if there is enough non-human single cell data to do a similar study on a different organism? Perhaps mouse might have enough for example?

Do you imagine this approach could be used for cross-species analysis, for example in which a mouse is compared to a human, or do you think it is limited to within species analysis?
Read the original source
Arcadia Science
Nov 6, 2023

The gPRINT algorithm accomplishes this by reordering the genes expressed ineach cell according to the human reference genome sequence HG38 (refer to the "Methods"section) and plotting the gene expression levels to generate its unique "gene print" (Figure S1).Building on the principles of deep learning applied in voice recognition, the algorithm treats thepositional information of gene open expression as temporal information in a sound wave. Eachgene interval is treated as a frame segment in a sound wave, and a one-dimensional neuralnetwork is used to learn from a specific reference dataset and automatically predict cell identitiesin the query dataset.

This is a very clever manipulation of the input data that allows it to be analyze in a new way. As someone who is relatively new to this field, it would be very helpful if either in this …

The gPRINT algorithm accomplishes this by reordering the genes expressed ineach cell according to the human reference genome sequence HG38 (refer to the "Methods"section) and plotting the gene expression levels to generate its unique "gene print" (Figure S1).Building on the principles of deep learning applied in voice recognition, the algorithm treats thepositional information of gene open expression as temporal information in a sound wave. Eachgene interval is treated as a frame segment in a sound wave, and a one-dimensional neuralnetwork is used to learn from a specific reference dataset and automatically predict cell identitiesin the query dataset.

This is a very clever manipulation of the input data that allows it to be analyze in a new way. As someone who is relatively new to this field, it would be very helpful if either in this section or the introduction if you could provide references for any approach within sequencing data that does something similar. If there is no similar approach to date, that would also be helpful to highlight. I think the idea of using an embedding is fairly common (e.g. word2vec), but it would be helpful to know the boundaries of innovation for this particular approach.

Read the original source
Version published to 10.1101/2023.11.05.565588 on bioRxiv
Nov 6, 2023

Cell-type-specific transcriptomic signatures associated with Alzheimer’s disease in the ROSMAP cohort: a single-nucleus RNA-seq pseudobulk analysis.

This article has 1 author:
1. Jose Israel Nadal Vidal
This article has no evaluationsLatest version Jan 6, 2026
Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature

This article has 2 authors:
1. Rafael de Negreiros Botan
2. Joao Batista de Sousa
This article has no evaluationsLatest version Dec 19, 2025
An RNA Modification–Associated Gene-Based Prognostic Model and Its Relevance to the Immune Microenvironment and Therapeutic Response in Lung Adenocarcinoma

This article has 1 author:
1. Zhen Wang
This article has no evaluationsLatest version Feb 2, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Cell-type-specific transcriptomic signatures associated with Alzheimer’s disease in the ROSMAP cohort: a single-nucleus RNA-seq pseudobulk analysis.

Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature

An RNA Modification–Associated Gene-Based Prognostic Model and Its Relevance to the Immune Microenvironment and Therapeutic Response in Lung Adenocarcinoma