Toward universal cell embeddings: integrating single-cell RNA-seq datasets across species with SATURN

Yanay Rosen
Maria Brbić
Yusuf Roohani
Kyle Swanson
Ziang Li
Jure Leskovec

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

Analysis of single-cell datasets generated from diverse organisms offers unprecedented opportunities to unravel fundamental evolutionary processes of conservation and diversification of cell types. However, interspecies genomic differences limit the joint analysis of cross-species datasets to homologous genes. Here we present SATURN, a deep learning method for learning universal cell embeddings that encodes genes’ biological properties using protein language models. By coupling protein embeddings from language models with RNA expression, SATURN integrates datasets profiled from different species regardless of their genomic similarity. SATURN can detect functionally related genes coexpressed across species, redefining differential expression for cross-species analysis. Applying SATURN to three species whole-organism atlases and frog and zebrafish embryogenesis datasets, we show that SATURN can effectively transfer annotations across species, even when they are evolutionarily remote. We also demonstrate that SATURN can be used to find potentially divergent gene functions between glaucoma-associated genes in humans and four other species.

Version published to 10.1038/s41592-024-02191-z
Feb 16, 2024
Arcadia Science
Jun 2, 2023
Abstract
1. This paper describes a new approach to analyze single-cell RNA-seq data from multiple species. This has been a challenge due to the different repertoire of genes from each species, and this paper seems to have solved the problem with a new approach to grouping genes across species based on embeddings from a protein language model. The idea is conceptually sound, and the results look quite good. The paper is quite clear and well-written. I have worked on this problem myself, and we have a forthcoming publication showing that protein structure space, while conceptually very similar to the embedding space used here, does not solve the task as well as SATURN. I have recommended this paper to my peers as an innovative approach, a useful method, and a clear demonstration of the power of protein language models to compare protein …
Abstract

This paper describes a new approach to analyze single-cell RNA-seq data from multiple species. This has been a challenge due to the different repertoire of genes from each species, and this paper seems to have solved the problem with a new approach to grouping genes across species based on embeddings from a protein language model. The idea is conceptually sound, and the results look quite good. The paper is quite clear and well-written. I have worked on this problem myself, and we have a forthcoming publication showing that protein structure space, while conceptually very similar to the embedding space used here, does not solve the task as well as SATURN. I have recommended this paper to my peers as an innovative approach, a useful method, and a clear demonstration of the power of protein language models to compare protein sequences
Read the original source
Arcadia Science
Jun 2, 2023

his indicates that mouse and lemur B cells are correctly clusteredwith human memory B cells, which is additionally confirmed by strong expression of Cd19. Thus,SATURN can be used to obtain fine-grained level annotations when cell atlases have been anno-tated with different granularity levels. Additionally, we found that SATURN correctly identifiedcell types specific to a single species within the integrated datasets. For instance, in muscle tissue,SATURN separated human epithelial and mesothelial cells from all other cell types (Supplemen-tary Figure 1). These cell types are indeed absent in mouse and lemur datasets. In spleen, SATURNseparated human erythrocytes (Supplementary Figure 3).

i found these examples quite helpful. these are the kinds of behaviors one would expect from a properly functioning cross-species integration method.

Read the original source
Arcadia Science
Jun 2, 2023

SATURN performs differential expression on macrogenes.

the authors use model weights to generate count matrices in macrogene space. this has the advantage of mapping genes that don’t have one-to-one orthologs. How important is this approach to the success of SATURN? Would SATURN work just as well if genes were simply grouped with the most similar macrogene and read counts were not distributed across macrogenes? There may be benefit in further study of macrogenes made from weighted counts as a broader concept, but it is not clear that the autoencoder used to learn SATURN weights is critical to the method. I suspect the latent space learned by the protein language model is responsible for most of the apparent success of SATURN, and that simpler dimensionality reduction and embedding methods would be sufficient to generate …

SATURN performs differential expression on macrogenes.

the authors use model weights to generate count matrices in macrogene space. this has the advantage of mapping genes that don’t have one-to-one orthologs. How important is this approach to the success of SATURN? Would SATURN work just as well if genes were simply grouped with the most similar macrogene and read counts were not distributed across macrogenes? There may be benefit in further study of macrogenes made from weighted counts as a broader concept, but it is not clear that the autoencoder used to learn SATURN weights is critical to the method. I suspect the latent space learned by the protein language model is responsible for most of the apparent success of SATURN, and that simpler dimensionality reduction and embedding methods would be sufficient to generate species-aligned datasets. In any case, most researchers want to work in gene space and SATURN crudely defaults to gene space for biological interpretation. In short, what are the benefits and drawbacks of a weighted macrogene space compared one based on simple assignment of genes to parent macrogenes? Would the authors like to comment on the utility of macrogenes in analyzing protein evolution more generally? or perhaps summarize key results from the emerging field of protein language embedding?

Read the original source
Arcadia Science
Jun 2, 2023

Macrogenes capture orthology.

The SAMap approach is here considered to be the best method after SATURN. SAMap is based on a rather similar gene-grouping approach to SATURN, but performs much worse. I wonder if the authors could comment, speculate, or experiment with the difference between protein sequence orthology and protein embedding similarity for this task. Is SAMap simply too restrictive in the number of genes that can be compared? or would a multi-gene-weighted, autoencoder-enabled SAMap perform better than the published SAMap results, even comparable to macrogene SATURN? Finally, do the authors assign functional meaning the weighted gene counts used by SATURN? theoretically, macrogenes are quite sensitive to gene function, so the mapping of genes to macrogenes and functions should be of great interest.

Read the original source
Arcadia Science
Jun 2, 2023
Abstract
1. This paper describes a new approach to analyze single-cell RNA-seq data from multiple species. This has been a challenge due to the different repertoire of genes from each species, and this paper seems to have solved the problem with a new approach to grouping genes across species based on embeddings from a protein language model. The idea is conceptually sound, and the results look quite good. The paper is quite clear and well-written. I have worked on this problem myself, and we have a forthcoming publication showing that protein structure space, while conceptually very similar to the embedding space used here, does not solve the task as well as SATURN. I have recommended this paper to my peers as an innovative approach, a useful method, and a clear demonstration of the power of protein language models to compare protein …
Abstract

This paper describes a new approach to analyze single-cell RNA-seq data from multiple species. This has been a challenge due to the different repertoire of genes from each species, and this paper seems to have solved the problem with a new approach to grouping genes across species based on embeddings from a protein language model. The idea is conceptually sound, and the results look quite good. The paper is quite clear and well-written. I have worked on this problem myself, and we have a forthcoming publication showing that protein structure space, while conceptually very similar to the embedding space used here, does not solve the task as well as SATURN. I have recommended this paper to my peers as an innovative approach, a useful method, and a clear demonstration of the power of protein language models to compare protein sequences
Read the original source
Arcadia Science
Jun 2, 2023

his indicates that mouse and lemur B cells are correctly clusteredwith human memory B cells, which is additionally confirmed by strong expression of Cd19. Thus,SATURN can be used to obtain fine-grained level annotations when cell atlases have been anno-tated with different granularity levels. Additionally, we found that SATURN correctly identifiedcell types specific to a single species within the integrated datasets. For instance, in muscle tissue,SATURN separated human epithelial and mesothelial cells from all other cell types (Supplemen-tary Figure 1). These cell types are indeed absent in mouse and lemur datasets. In spleen, SATURNseparated human erythrocytes (Supplementary Figure 3).

i found these examples quite helpful. these are the kinds of behaviors one would expect from a properly functioning cross-species integration method.

Read the original source
Arcadia Science
Jun 2, 2023

SATURN performs differential expression on macrogenes.

the authors use model weights to generate count matrices in macrogene space. this has the advantage of mapping genes that don’t have one-to-one orthologs. How important is this approach to the success of SATURN? Would SATURN work just as well if genes were simply grouped with the most similar macrogene and read counts were not distributed across macrogenes? There may be benefit in further study of macrogenes made from weighted counts as a broader concept, but it is not clear that the autoencoder used to learn SATURN weights is critical to the method. I suspect the latent space learned by the protein language model is responsible for most of the apparent success of SATURN, and that simpler dimensionality reduction and embedding methods would be sufficient to generate …

SATURN performs differential expression on macrogenes.

the authors use model weights to generate count matrices in macrogene space. this has the advantage of mapping genes that don’t have one-to-one orthologs. How important is this approach to the success of SATURN? Would SATURN work just as well if genes were simply grouped with the most similar macrogene and read counts were not distributed across macrogenes? There may be benefit in further study of macrogenes made from weighted counts as a broader concept, but it is not clear that the autoencoder used to learn SATURN weights is critical to the method. I suspect the latent space learned by the protein language model is responsible for most of the apparent success of SATURN, and that simpler dimensionality reduction and embedding methods would be sufficient to generate species-aligned datasets. In any case, most researchers want to work in gene space and SATURN crudely defaults to gene space for biological interpretation. In short, what are the benefits and drawbacks of a weighted macrogene space compared one based on simple assignment of genes to parent macrogenes? Would the authors like to comment on the utility of macrogenes in analyzing protein evolution more generally? or perhaps summarize key results from the emerging field of protein language embedding?

Read the original source
Arcadia Science
Jun 2, 2023

Macrogenes capture orthology.

The SAMap approach is here considered to be the best method after SATURN. SAMap is based on a rather similar gene-grouping approach to SATURN, but performs much worse. I wonder if the authors could comment, speculate, or experiment with the difference between protein sequence orthology and protein embedding similarity for this task. Is SAMap simply too restrictive in the number of genes that can be compared? or would a multi-gene-weighted, autoencoder-enabled SAMap perform better than the published SAMap results, even comparable to macrogene SATURN? Finally, do the authors assign functional meaning the weighted gene counts used by SATURN? theoretically, macrogenes are quite sensitive to gene function, so the mapping of genes to macrogenes and functions should be of great interest.

Read the original source
Version published to 10.1101/2023.02.03.526939 on bioRxiv
Feb 3, 2023

Pathway-informed Universal Domain Adaptation for Single-cell RNA-seq Data

This article has 6 authors:
1. Xinrong Wei
2. Xingyi Li
3. Huan Liu
4. Gaoyuan Du
5. Feng Wei
6. Xuequn Shang
This article has no evaluationsLatest version May 11, 2026
Global cell-state and gene-program representations reveal conserved and context-specific perturbation responses of cells

This article has 5 authors:
1. Xingjie Pan
2. Reuben Saunders
3. Joseph M. Replogle
4. Jonathan Weissman
5. Xiaowei Zhuang
This article has no evaluationsLatest version May 18, 2026
geneML: Gene annotation across diverse fungal species using deep learning

This article has 4 authors:
1. Lisa Vader
2. Colin J.B. Harvey
3. Tilmann Weber
4. Lawrence S. Hon
This article has no evaluationsLatest version May 21, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Pathway-informed Universal Domain Adaptation for Single-cell RNA-seq Data

Global cell-state and gene-program representations reveal conserved and context-specific perturbation responses of cells

geneML: Gene annotation across diverse fungal species using deep learning