Mapping single-cell atlases throughout Metazoa unravels cell type evolution

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    The development of single-cell genomic methods has transformed our understanding of cell types and their attributes across organisms. Here, Tarashansky et al develop SAMap (Self-Assembling Manifold mapping), a graph-based data integration method which builds upon their previously described SAM algorithm to facilitate assignment of homologous genes and cell types across diverse species. As the authors show, this empowers comparative analyses across phyla to facilitate cellular annotation and examine the evolutionary origins of cellular diversity. Overall, the manuscript is well-written and the algorithm has the potential to be foundation for comparative cellular atlasing.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

This article has been Reviewed by the following groups

Read the full article

Abstract

Comparing single-cell transcriptomic atlases from diverse organisms can elucidate the origins of cellular diversity and assist the annotation of new cell atlases. Yet, comparison between distant relatives is hindered by complex gene histories and diversifications in expression programs. Previously, we introduced the self-assembling manifold (SAM) algorithm to robustly reconstruct manifolds from single-cell data (Tarashansky et al., 2019). Here, we build on SAM to map cell atlas manifolds across species. This new method, SAMap, identifies homologous cell types with shared expression programs across distant species within phyla, even in complex examples where homologous tissues emerge from distinct germ layers. SAMap also finds many genes with more similar expression to their paralogs than their orthologs, suggesting paralog substitution may be more common in evolution than previously appreciated. Lastly, comparing species across animal phyla, spanning sponge to mouse, reveals ancient contractile and stem cell families, which may have arisen early in animal evolution.

Article activity feed

  1. Author Response:

    Reviewer #1:

    This manuscript presents a generalizable tool for the comparison of single-cell atlases across species. The work addresses an important problem given the proliferation of such cataloguing efforts across a rapidly increasing diversity of organisms, and the opportunities this presents for comparative and evolutionary biology. The algorithms developed extend the use of self-assembling manifolds to this critical problem by addressing key challenges in the assignment of homologous genes and cell types. The method will be extremely useful for comparative studies to understand the evolutionary relationship of different cell types, and to quickly assign the cell type identity to new single-cell atlases by taking advantage of existing datasets. The authors demonstrate the robustness of the method by comparing cell atlases from diverse metazoans. In the process, the authors arrive at three provocative evolutionary conclusions that will require further investigation to fully support: widespread paralog substitutions, the multifunctionality of ancestral contractile cells, and the existence of a deeply conserved gene module associated with multipotency.

    Strengths:

    A key advantage of the approach presented is the relaxation of one-to-one mapping of orthologous genes, instead considering all possible homologous sequences in the alignment of the transcriptomes. Similarly, alignment of cell types is achieved by taking into account the general neighborhood of cell types and not just the closest match. The authors show that the algorithm outperforms existing methods, which were not really developed for the alignment of distantly related cell types. I expect this method will therefore be of general interest to anyone working with diverse organisms.

    Cell types inferred from the use of algorithm could be validated in the poorly studied parasite Schistosoma mansoni. These experiments provide a glimpse into the broad utility of the analysis presented, which can be used as a resource in itself.

    We thank the review for these positive comments.

    Weaknesses:

    The observation of widespread paralog substitution may be complicated by the use of relaxed gene orthology assignments in the initial alignment of cell types. It will be important to see whether similar levels of paralog substitution are observed when the paralogs in question are excluded during manifold assembly. This would ensure that the apparent paralog substitution is not a consequence of the necessary relaxation of ortholog assignments.

    We have performed the suggested analysis, with results summarized in the reply to the editor’s comments 2.1. and copied below.

    SAMap yields a similar combined manifold when using only one-to-one orthologs (Figure 2E), suggesting that at least for the zebrafish-frog comparison the paralogs are not driving the manifold mapping. To rule out the possibility that these paralogs were linked spuriously during the homology refinement steps of SAMap, we repeated the paralog substitution analysis on the combined manifold constructed using only one-to-one orthologs. This identified a largely similar set of paralog substitution events, although weaker manifold alignment when restricting the mapping to one-to-one orthologs led to the loss of some substitution paralogs that showed lower correlations. These new results are now reported in Figure 3 – figure supplement 1 and discussed in the text (lines 242-251).

    Further study of this phenomenon could reveal whether paralogs are more likely to be substituted in cases where they arose more recently, and whether the substitutions are stable within clades-perhaps elucidating different paths of specialization following the ancestral gene duplication event.

    To determine whether paralog substitution depends on how recently they arose, we used the orthology groups provided by Eggnog to infer when paralogs duplicated during evolution. We found that more recent paralogs substitute at higher rates than more ancestral paralogs, which is in line with the expectation that less diverged genes are likely more capable of functionally substituting each other (Figure 3C). We also used the paralog substitution score to quantify the rate of paralog substitution in each cell type and observed that substituting paralogs are expressed in a wide variety of cell types, with some (e.g., dorsal organizer) exhibiting higher rates than others (Figure 3B), indicating uneven diversification rates of paralogs across cell types. Unfortunately, assessing the stability of paralog substitutions within a clade requires more cell atlases than what are available at the moment. This analysis needs to densely sample species within clades and at key branching points along the tree of life. We now discuss these new results and possible future directions in the text (lines 229-231, lines 237-242, and lines 448-455).

    The claim that ancestral contractile cells were multifunctional demands closer exploration of the gene module common to this cell type across species. Cellular contractility is a complex process in any cell and the distribution of the gene module across categories of signaling, actin regulation, and cell adhesion does not in itself imply multifunctionality.

    This comment has been addressed in the reply to editor’s comments 2.3., which is copied below.

    We apologize for this confusing statement. We have modified the text (lines 356-359) to clarify that ancestral contractile cells may already possess the broad assemblage of gene modules associated with different functional aspects of modern muscle cell types, including the adhesion complex that connects cells, actomyosin networks that drive contractility, and signaling pathways that stimulate contraction.

    The authors also point to a second enriched module within multipotent cells (stem cells) which could be investigated further. Cursory analysis suggests that the gene signature might simply be the consequence of actively dividing cells lacking specialized cell identity markers, as opposed to a more fundamental program of multipotency.

    Thanks for noting this potential point of confusion. We now provide three lines of evidence to show that stem cells are mapped through similarities beyond common features of dividing cells. First, though we did observe conservation of genes involved in cell cycle and DNA replication, they are not the most enriched gene categories (Figure 6C). Second, we have now performed new analysis to compare multipotent stem cells (MSCs), lineage-restricted stem cells, and differentiated cells for all four invertebrates analyzed in this study. We found that the conserved genes in MSCs consistently have lower expression in lineage-restricted stem cells, which also divide actively. This suggests that the gene expression program associated with MSCs is not shared by all dividing cells. Finally, this new analysis also identified several transcriptional regulators enriched in MSCs compared with other stem cells (Figure 6D). These genes include members of transcription factor families that are known to be essential in mammalian pluripotency (e.g., sox and klf) and chromatin modifiers that are not directly associated with the cell cycle but have reported functions in stem cell maintenance (e.g., kat7 and sub1). These new results are now discussed in lines 380-399.

    Reviewer #2:

    The authors sought to build upon their previously methods (self-assembling manifolds) to utilize these data representations to compare single cell atlases between organisms and compare cell types.

    Major strengths of the paper include:

    1. Benchmarking against state of the art integration methods
    1. Clever framework to relax the constraints on sequence orthology
    1. Many comparisons across diverse organisms

    The authors achieve their proposed aims and these tools may provide useful insight for the field going forward; however, it would be useful for the authors to highlight any potential limitations to the approach, places where comparisons did not work out well, etc.

    We thank the reviewer for this great suggestion. As detailed in the reply to editor’s comments 1.1-1.2, we have now performed new analysis and discussed potential limitations. These include the scalability to large datasets, the applicability to datasets collected across different pipelines, and the robustness to overfitting.

    Reviewer #3:

    The manuscript by Tarashansky et al., builds on this group's recently developed self-assembling manifold algorithm to develop methods for aligning cells of the same type across distantly related species using single cell gene expression data. The new method, SAMap, considers homologous genes in a novel way that takes into account paralog substitutions through gene expression correlations and the method further considers cell neighborhood relationships within and between species. Together, and through iterative analysis, these innovations maximally utilize the single cell data compared with only considering 1:1 orthologous genes and direct transcriptional correlations of cell types. Importantly (based on assumptions about cell type evolution), this method can identify homologous cell types based on shared neighbors, even if gene expression has diverged. The authors first apply SAMap to identify homologous cell types between developing zebrafish and xenopus at the whole organism level. SAMap captures nearly all homologous cell types, even with 1:1 orthologs using the mutual nearest neighbors approach whereas other top-in-field methods do poorly at this large evolutionary distance. SAMap also identifies 565 examples of candidate paralog substitution based on closer expression correlation of paralogs than orthrologs. The authors further extend these comparisons to flatworms and trematodes, and then to further include sponge, Hydra, and mouse. One fascinating result is that Spongilla choanocytes and apopylar cells show homology to the neuronal family, supporting recent predictions.

    Overall, I find this approach extremely powerful and likely to be widely used in the study of cell type evolution and separately in the study of gene neofunctionalization. The validation among known homologs in distant vertebrates and benchmarking is convincing. My only major comment is that the authors could try a "leave one cluster out" analysis in the zebrafish xenopus comparison to ensure that the method does not overfit when a homologous cell type is absent.

    Thanks for this great suggestion. We have performed the analysis and the results are summarized in the reply to editor’s comments 1.2. and copied below.

    To evaluate if SAMap overfits in cases where some cell types are missing, we performed dropout experiments in which we systematically removed each cell type that has an annotated homolog in the comparison of zebrafish and frog atlases. Cell types whose homologous partners were removed weakly mapped to closely related cell types, and most of these links were already present in the original analysis (Supplementary File 3). For example, optic cells from both species are also connected to eye primordium, frog skeletal muscles to zebrafish presomitic mesoderm, and frog hindbrain to zebrafish forebrain/midbrain. While we observed several mappings that were not present in the original analysis, their alignment scores were all barely above the detection threshold of SAMap. Moreover, most of these edges were mapped between cell types with similar developmental origins, with the only exception being the zebrafish neural crest mapped to the frog otic placode in the absence of frog neural crest cells. Examining the genes that support this mapping revealed that both cell types express sox9 and sox10, two TFs previously implicated to form a conserved gene regulatory circuit common to otic/neural crest cells (Betancur et al., 2011). These results are now discussed in the text (lines 194-210).

    Minor comments:

    I am confused about how the homologous zebrafish and xenopus secretory cells with different developmental origins fit into the evolutionary cell type model. Could the foxa1 grhl cells that differ in their germ layer cells represent homology via horizontal transmission of a shared secretory gene network and convergent function rather than identity by descent and hierarchical diversification of a shared developmental gene regulatory network?

    We thank the reviewer for raising this important point. We now provide a deeper discussion about key transcription factors that are conserved between the secretory cell types (lines 166-175), as well as additional discussion regarding cell type homology and evolutionary convergence (lines 427-436). Specifically, we point out that the shared TFs are known to play important roles in specifying secretory cell types. For example, we now identified a shared TF (klf17) between zebrafish and frog hatching glands, which arise from different germ layers. klf17 homologs have been shown to be crucial for the specification of the hatching glands in both zebrafish and frog (Kurauchi et al., 2010; Suzuki et al., 2019). The fact that these cells types share a number of TFs implicated in secretory cell type specification suggests they are evolutionary homologs, and did not evolve their functions convergently. This secretory cell type regulatory network has been likely redeployed (or co-opted) into different developmental lineages. Developmentally, this resembles convergence because different developmental lineages converge on similar identities. However, this is distinct from evolutionary convergence, because the secretory cell type regulatory network – composed of cell type-specific TFs and their downstream effector targets – evolved only once. Under evolutionary convergence, we would expect to observe different TFs driving secretory effector gene expression, reflecting the different cell type specification networks that converged on similar functions. However, fully resolving this evolutionary history will require further characterization of these networks in fish, frogs, and a broader array of vertebrates, which is outside the scope of this study. We hope our observations and discussion on this topic will stimulate research in this direction, and again thank the reviewer for raising this point.

    Are there any differences in the properties of genes that are deeply conserved in metazoan cell types (e.g., Fox, Csrp families in contractile cells) vs. genes that are more lineage restricted (e.g., mef2) - for example are the more conserved genes more central in regulatory networks within a species and thus more constrained?

    We agree with the reviewer that this is an important question. Genes that are deeply conserved throughout metazoan cell type families may be more central to the regulatory network compared to lineage-restricted genes. We now mention in the text (lines 371-373) that this is an important question to address in future studies.

    Why did heart, germline, and olfactory placode cells not cluster in the xenopus atlas - these seem like conserved populations, or was this due to sampling / staging?

    In the original analysis of the frog atlas, some cell clusters were isolated and subjected to a second round of sub-clustering. The final clustering assignments can therefore reflect very subtle differences that are not apparent when considering the entire dataset. As a result, the germline cells are scattered across the reconstructed manifold and do not concentrate in a distinct cluster. The heart cells and olfactory placode cells are inextricably mixed with larger populations of intermediate mesoderm and placodal cells, respectively. We have now clarified this potential point of confusion in the methods section (lines 635-642).

  2. Reviewer #3 (Public Review):

    The manuscript by Tarashansky et al., builds on this group's recently developed self-assembling manifold algorithm to develop methods for aligning cells of the same type across distantly related species using single cell gene expression data. The new method, SAMap, considers homologous genes in a novel way that takes into account paralog substitutions through gene expression correlations and the method further considers cell neighborhood relationships within and between species. Together, and through iterative analysis, these innovations maximally utilize the single cell data compared with only considering 1:1 orthologous genes and direct transcriptional correlations of cell types. Importantly (based on assumptions about cell type evolution), this method can identify homologous cell types based on shared neighbors, even if gene expression has diverged. The authors first apply SAMap to identify homologous cell types between developing zebrafish and xenopus at the whole organism level. SAMap captures nearly all homologous cell types, even with 1:1 orthologs using the mutual nearest neighbors approach whereas other top-in-field methods do poorly at this large evolutionary distance. SAMap also identifies 565 examples of candidate paralog substitution based on closer expression correlation of paralogs than orthrologs. The authors further extend these comparisons to flatworms and trematodes, and then to further include sponge, Hydra, and mouse. One fascinating result is that Spongilla choanocytes and apopylar cells show homology to the neuronal family, supporting recent predictions.

    Overall, I find this approach extremely powerful and likely to be widely used in the study of cell type evolution and separately in the study of gene neofunctionalization. The validation among known homologs in distant vertebrates and benchmarking is convincing. My only major comment is that the authors could try a "leave one cluster out" analysis in the zebrafish xenopus comparison to ensure that the method does not overfit when a homologous cell type is absent.

    Minor comments:

    I am confused about how the homologous zebrafish and xenopus secretory cells with different developmental origins fit into the evolutionary cell type model. Could the foxa1 grhl cells that differ in their germ layer cells represent homology via horizontal transmission of a shared secretory gene network and convergent function rather than identity by descent and hierarchical diversification of a shared developmental gene regulatory network?

    Are there any differences in the properties of genes that are deeply conserved in metazoan cell types (e.g., Fox, Csrp families in contractile cells) vs. genes that are more lineage restricted (e.g., mef2) - for example are the more conserved genes more central in regulatory networks within a species and thus more constrained?

    Why did heart, germline, and olfactory placode cells not cluster in the xenopus atlas - these seem like conserved populations, or was this due to sampling / staging?

  3. Reviewer #2 (Public Review):

    The authors sought to build upon their previously methods (self-assembling manifolds) to utilize these data representations to compare single cell atlases between organisms and compare cell types.

    Major strengths of the paper include:

    1. Benchmarking against state of the art integration methods

    2. Clever framework to relax the constraints on sequence orthology

    3. Many comparisons across diverse organisms

    The authors achieve their proposed aims and these tools may provide useful insight for the field going forward; however, it would be useful for the authors to highlight any potential limitations to the approach, places where comparisons did not work out well, etc.

  4. Reviewer #1 (Public Review):

    This manuscript presents a generalizable tool for the comparison of single-cell atlases across species. The work addresses an important problem given the proliferation of such cataloguing efforts across a rapidly increasing diversity of organisms, and the opportunities this presents for comparative and evolutionary biology. The algorithms developed extend the use of self-assembling manifolds to this critical problem by addressing key challenges in the assignment of homologous genes and cell types. The method will be extremely useful for comparative studies to understand the evolutionary relationship of different cell types, and to quickly assign the cell type identity to new single-cell atlases by taking advantage of existing datasets. The authors demonstrate the robustness of the method by comparing cell atlases from diverse metazoans. In the process, the authors arrive at three provocative evolutionary conclusions that will require further investigation to fully support: widespread paralog substitutions, the multifunctionality of ancestral contractile cells, and the existence of a deeply conserved gene module associated with multipotency.

    Strengths:

    A key advantage of the approach presented is the relaxation of one-to-one mapping of orthologous genes, instead considering all possible homologous sequences in the alignment of the transcriptomes. Similarly, alignment of cell types is achieved by taking into account the general neighborhood of cell types and not just the closest match. The authors show that the algorithm outperforms existing methods, which were not really developed for the alignment of distantly related cell types. I expect this method will therefore be of general interest to anyone working with diverse organisms.

    Cell types inferred from the use of algorithm could be validated in the poorly studied parasite Schistosoma mansoni. These experiments provide a glimpse into the broad utility of the analysis presented, which can be used as a resource in itself.

    Weaknesses:

    The observation of widespread paralog substitution may be complicated by the use of relaxed gene orthology assignments in the initial alignment of cell types. It will be important to see whether similar levels of paralog substitution are observed when the paralogs in question are excluded during manifold assembly. This would ensure that the apparent paralog substitution is not a consequence of the necessary relaxation of ortholog assignments. Further study of this phenomenon could reveal whether paralogs are more likely to be substituted in cases where they arose more recently, and whether the substitutions are stable within clades-perhaps elucidating different paths of specialization following the ancestral gene duplication event.

    The claim that ancestral contractile cells were multifunctional demands closer exploration of the gene module common to this cell type across species. Cellular contractility is a complex process in any cell and the distribution of the gene module across categories of signaling, actin regulation, and cell adhesion does not in itself imply multifunctionality. The authors also point to a second enriched module within multipotent cells (stem cells) which could be investigated further. Cursory analysis suggests that the gene signature might simply be the consequence of actively dividing cells lacking specialized cell identity markers, as opposed to a more fundamental program of multipotency.

  5. Evaluation Summary:

    The development of single-cell genomic methods has transformed our understanding of cell types and their attributes across organisms. Here, Tarashansky et al develop SAMap (Self-Assembling Manifold mapping), a graph-based data integration method which builds upon their previously described SAM algorithm to facilitate assignment of homologous genes and cell types across diverse species. As the authors show, this empowers comparative analyses across phyla to facilitate cellular annotation and examine the evolutionary origins of cellular diversity. Overall, the manuscript is well-written and the algorithm has the potential to be foundation for comparative cellular atlasing.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)