Toward universal cell embeddings: integrating single-cell RNA-seq datasets across species with SATURN

This article has been Reviewed by the following groups

Read the full article

Listed in

Log in to save this article

Abstract

Analysis of single-cell datasets generated from diverse organisms offers unprecedented opportunities to unravel fundamental evolutionary processes of conservation and diversification of cell types. However, interspecies genomic differences limit the joint analysis of cross-species datasets to homologous genes. Here we present SATURN, a deep learning method for learning universal cell embeddings that encodes genes’ biological properties using protein language models. By coupling protein embeddings from language models with RNA expression, SATURN integrates datasets profiled from different species regardless of their genomic similarity. SATURN can detect functionally related genes coexpressed across species, redefining differential expression for cross-species analysis. Applying SATURN to three species whole-organism atlases and frog and zebrafish embryogenesis datasets, we show that SATURN can effectively transfer annotations across species, even when they are evolutionarily remote. We also demonstrate that SATURN can be used to find potentially divergent gene functions between glaucoma-associated genes in humans and four other species.

Article activity feed

  1. Abstract

    1. This paper describes a new approach to analyze single-cell RNA-seq data from multiple species. This has been a challenge due to the different repertoire of genes from each species, and this paper seems to have solved the problem with a new approach to grouping genes across species based on embeddings from a protein language model. The idea is conceptually sound, and the results look quite good. The paper is quite clear and well-written. I have worked on this problem myself, and we have a forthcoming publication showing that protein structure space, while conceptually very similar to the embedding space used here, does not solve the task as well as SATURN. I have recommended this paper to my peers as an innovative approach, a useful method, and a clear demonstration of the power of protein language models to compare protein sequences
  2. his indicates that mouse and lemur B cells are correctly clusteredwith human memory B cells, which is additionally confirmed by strong expression of Cd19. Thus,SATURN can be used to obtain fine-grained level annotations when cell atlases have been anno-tated with different granularity levels. Additionally, we found that SATURN correctly identifiedcell types specific to a single species within the integrated datasets. For instance, in muscle tissue,SATURN separated human epithelial and mesothelial cells from all other cell types (Supplemen-tary Figure 1). These cell types are indeed absent in mouse and lemur datasets. In spleen, SATURNseparated human erythrocytes (Supplementary Figure 3).

    i found these examples quite helpful. these are the kinds of behaviors one would expect from a properly functioning cross-species integration method.

  3. SATURN performs differential expression on macrogenes.

    the authors use model weights to generate count matrices in macrogene space. this has the advantage of mapping genes that don’t have one-to-one orthologs. How important is this approach to the success of SATURN? Would SATURN work just as well if genes were simply grouped with the most similar macrogene and read counts were not distributed across macrogenes? There may be benefit in further study of macrogenes made from weighted counts as a broader concept, but it is not clear that the autoencoder used to learn SATURN weights is critical to the method. I suspect the latent space learned by the protein language model is responsible for most of the apparent success of SATURN, and that simpler dimensionality reduction and embedding methods would be sufficient to generate species-aligned datasets. In any case, most researchers want to work in gene space and SATURN crudely defaults to gene space for biological interpretation. In short, what are the benefits and drawbacks of a weighted macrogene space compared one based on simple assignment of genes to parent macrogenes? Would the authors like to comment on the utility of macrogenes in analyzing protein evolution more generally? or perhaps summarize key results from the emerging field of protein language embedding?

  4. Macrogenes capture orthology.

    The SAMap approach is here considered to be the best method after SATURN. SAMap is based on a rather similar gene-grouping approach to SATURN, but performs much worse. I wonder if the authors could comment, speculate, or experiment with the difference between protein sequence orthology and protein embedding similarity for this task. Is SAMap simply too restrictive in the number of genes that can be compared? or would a multi-gene-weighted, autoencoder-enabled SAMap perform better than the published SAMap results, even comparable to macrogene SATURN? Finally, do the authors assign functional meaning the weighted gene counts used by SATURN? theoretically, macrogenes are quite sensitive to gene function, so the mapping of genes to macrogenes and functions should be of great interest.

  5. Abstract

    1. This paper describes a new approach to analyze single-cell RNA-seq data from multiple species. This has been a challenge due to the different repertoire of genes from each species, and this paper seems to have solved the problem with a new approach to grouping genes across species based on embeddings from a protein language model. The idea is conceptually sound, and the results look quite good. The paper is quite clear and well-written. I have worked on this problem myself, and we have a forthcoming publication showing that protein structure space, while conceptually very similar to the embedding space used here, does not solve the task as well as SATURN. I have recommended this paper to my peers as an innovative approach, a useful method, and a clear demonstration of the power of protein language models to compare protein sequences
  6. his indicates that mouse and lemur B cells are correctly clusteredwith human memory B cells, which is additionally confirmed by strong expression of Cd19. Thus,SATURN can be used to obtain fine-grained level annotations when cell atlases have been anno-tated with different granularity levels. Additionally, we found that SATURN correctly identifiedcell types specific to a single species within the integrated datasets. For instance, in muscle tissue,SATURN separated human epithelial and mesothelial cells from all other cell types (Supplemen-tary Figure 1). These cell types are indeed absent in mouse and lemur datasets. In spleen, SATURNseparated human erythrocytes (Supplementary Figure 3).

    i found these examples quite helpful. these are the kinds of behaviors one would expect from a properly functioning cross-species integration method.

  7. SATURN performs differential expression on macrogenes.

    the authors use model weights to generate count matrices in macrogene space. this has the advantage of mapping genes that don’t have one-to-one orthologs. How important is this approach to the success of SATURN? Would SATURN work just as well if genes were simply grouped with the most similar macrogene and read counts were not distributed across macrogenes? There may be benefit in further study of macrogenes made from weighted counts as a broader concept, but it is not clear that the autoencoder used to learn SATURN weights is critical to the method. I suspect the latent space learned by the protein language model is responsible for most of the apparent success of SATURN, and that simpler dimensionality reduction and embedding methods would be sufficient to generate species-aligned datasets. In any case, most researchers want to work in gene space and SATURN crudely defaults to gene space for biological interpretation. In short, what are the benefits and drawbacks of a weighted macrogene space compared one based on simple assignment of genes to parent macrogenes? Would the authors like to comment on the utility of macrogenes in analyzing protein evolution more generally? or perhaps summarize key results from the emerging field of protein language embedding?

  8. Macrogenes capture orthology.

    The SAMap approach is here considered to be the best method after SATURN. SAMap is based on a rather similar gene-grouping approach to SATURN, but performs much worse. I wonder if the authors could comment, speculate, or experiment with the difference between protein sequence orthology and protein embedding similarity for this task. Is SAMap simply too restrictive in the number of genes that can be compared? or would a multi-gene-weighted, autoencoder-enabled SAMap perform better than the published SAMap results, even comparable to macrogene SATURN? Finally, do the authors assign functional meaning the weighted gene counts used by SATURN? theoretically, macrogenes are quite sensitive to gene function, so the mapping of genes to macrogenes and functions should be of great interest.