Universal Cell Embeddings: A Foundation Model for Cell Biology

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Developing a universal representation of cells which encompasses the tremendous molecular diversity of cell types within the human body and more generally, across species, would be transformative for cell biology. Recent work using single-cell transcriptomic approaches to create molecular definitions of cell types in the form of cell atlases has provided the necessary data for such an endeavor. Here, we present the Universal Cell Embedding (UCE) foundation model. UCE was trained on a corpus of cell atlas data from human and other species in a completely self-supervised way without any data annotations. UCE offers a unified biological latent space that can represent any cell, regardless of tissue or species. This universal cell embedding captures important biological variation despite the presence of experimental noise across diverse datasets. An important aspect of UCE’s universality is that any new cell from any organism can be mapped to this embedding space with no additional data labeling, model training or fine-tuning. We applied UCE to create the Integrated Mega-scale Atlas, embedding 36 million cells, with more than 1,000 uniquely named cell types, from hundreds of experiments, dozens of tissues and eight species. We uncovered new insights about the organization of cell types and tissues within this universal cell embedding space, and leveraged it to infer function of newly discovered cell types. UCE’s embedding space exhibits emergent behavior, uncovering new biology that it was never explicitly trained for, such as identifying developmental lineages and embedding data from novel species not included in the training set. Overall, by enabling a universal representation for every cell state and type, UCE provides a valuable tool for analysis, annotation and hypothesis generation as the scale and diversity of single cell datasets continues to grow.

Article activity feed

  1. Start tokens are unique to each chromosome and species

    This feels confusing: if start tokens are unique to species, how is UCE able to generate embeddings for datasets from species it was not trained on?

  2. Every376chromosome group is combined into a single sequence, with chromosome order randomly deter-377mined.

    It's surprising to me that chromosomes are randomly ordered; this feels a bit like the equivalent of randomly shuffling the clauses of a sentence. It would be helpful to explain this choice or discuss reasons why it might or might not be a concern.

  3. However, beyond that,238the effect levels off (Supplementary Fig. 6). This is expected due to the curse of dimensionality239in high-dimensional spaces and the variability in the level of ontological refinement in different240branches of the ontology

    This feels awfully hand-wavy. I can understand that a leveling off is expected at some distance, by why at 5 hops in particular?

  4. or all three species we observed204very high agreement between independent annotations of the novel species’ data and the nearest205cell type centroids in the IMA

    It would be helpful to mention here what these three species were and how distantly related they are to the eight species on which UCE was trained.

  5. We train185a simple logistic classifier on the UCE embeddings of the Immune Cell Atlas [38], and then apply186the classifier to B cell embeddings from Tabula Sapiens v2. This classifier accurately classifies the187Tabula Sapiens v2 cells as memory and naive B cell

    This result feels hard to interpret without a comparison to other approaches or models. In other words, are embeddings from UCE uniquely able to capture the information required for this classification task?

  6. UCE embeddings174distinctly separate cell types more effectively than other methods tested in zero-shot

    This feels a bit subjective; I think this claim would be more convincing if it were grounded in a quantitative measure of clustering accuracy.

  7. We compared several methods and found that UCE substantially out-167performs the next best method Geneformer by 9.0% on overall score, 10.6% on biological conser-168vation score, and 7.4% on batch correction scor

    If possible, it would be helpful to contextualize these relative increases in performance, particular given that none of the models listed in Supp Table 1 appear to significantly outperform using the log-normalized raw data. (the "overall score" is 0.74 for UCE and 0.72 for "log-normalized expression"). Without more context, it's hard to know what this means, whether it should be surprising, whether it reflects limitations of the metrics or of the models, etc.

    Also, I think it would be more transparent to mention here that there are two metrics for which UCE does not outperform other models (the ARI score and the "ASW (batch) score").

  8. Genes belonging to the same chromosome are grouped111together by placing them in between special tokens and are then sorted by genomic location

    It would be helpful to understand the context and motivation for this design decision. In other words, what aspects of UCE's performance depend (or are suspected to depend) on including information about genomic position?

  9. This allows UCE to meaningfully99represent any gene, from any species, regardless of whether the species had appeared in the training100data

    It would be good to clarify here if "training data" refers to the data used to train the protein language model or UCE itself.