Cross-Dataset Identification of Human Disease-Specific Cell Subtypes Enabled by the Gene Print-based Algorithm--gPRINT

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article


Despite extensive efforts in developing cell annotation algorithms for single cell RNA sequencing results, most algorithms fail to achieve cross-dataset mapping of cell subtypes due to factors such as batch effects between datasets. This limitation is particularly evident when rapidly annotating disease-specific cell subtypes across multiple datasets. In this study, we present gPRINT, a machine learning tool that utilizes the unique one-dimensional “gene print” expression patterns of individual cells. gPRINT is capable of automatically predicting cell types and annotating disease-specific cell subtypes. The development of gPRINT involved curation and harmonization of public datasets, algorithm validation within and across datasets, and the annotation of disease-specific fibroblast subtypes across various disease subgroups and datasets. Additionally, we created a preliminary single-cell atlas of human tendinopathy fibroblasts and successfully achieved automatic prediction of disease-specific cell subtypes in tendon disease. Furthermore, we conducted an exploration of key targets and related drugs specific to this subtype in tendon disease. The proposed approach offers an automated and unified method for identifying disease-specific cell subtypes across datasets, serving as a valuable reference for annotating fibroblast-specific subtypes in different disease states and facilitating the exploration of therapeutic targets in tendon disease.

Article activity feed

  1. Application of thisalgorithm to single-cell data from other biological species can increase our understanding ofbiodiversity

    I have some questions about this statement.

    1. do you know if there is enough non-human single cell data to do a similar study on a different organism? Perhaps mouse might have enough for example?
    2. Do you imagine this approach could be used for cross-species analysis, for example in which a mouse is compared to a human, or do you think it is limited to within species analysis?
  2. The gPRINT algorithm accomplishes this by reordering the genes expressed ineach cell according to the human reference genome sequence HG38 (refer to the "Methods"section) and plotting the gene expression levels to generate its unique "gene print" (Figure S1).Building on the principles of deep learning applied in voice recognition, the algorithm treats thepositional information of gene open expression as temporal information in a sound wave. Eachgene interval is treated as a frame segment in a sound wave, and a one-dimensional neuralnetwork is used to learn from a specific reference dataset and automatically predict cell identitiesin the query dataset.

    This is a very clever manipulation of the input data that allows it to be analyze in a new way. As someone who is relatively new to this field, it would be very helpful if either in this section or the introduction if you could provide references for any approach within sequencing data that does something similar. If there is no similar approach to date, that would also be helpful to highlight. I think the idea of using an embedding is fairly common (e.g. word2vec), but it would be helpful to know the boundaries of innovation for this particular approach.