Capturing Context in Organismal Development with a Large Language Model
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Development, the process by which cells differentiate into specific fates to build the adult organism, is arguably the most context-dependent phenomenon in biology, shaped by constant interactions among neighboring cells and signals. Here, we investigate how a large language model (LLM) can leverage such contextuality to capture the dynamics of cellular differentiation. Using single-cell RNA activity data from developing zebrafish embryos, we train a domain-specific LLM, Zebraformer. We demonstrate that the gene and cell embeddings, numerical representations encoding biological meaning, produced by Zebraformer reflect the progression of zebrafish organismal formation, including the emergence of major anatomical domains and cell types. Furthermore, by using the model’s attention matrices, which capture how strongly genes influence each other, we observe patterns consistent with the developmental hourglass framework. This framework proposes that the middle phase of embryogenesis is more conserved across species than the early or late phases. At this phylotypic stage, the attention matrices reveal a rise in gene regulatory network complexity, reflected in increased node and edge counts and a narrowing-then-broadening network topology. This stage also shows greater sensitivity to genetic perturbations, as indicated by shifts in gene embeddings, pointing to heightened constraint consistent with the hourglass assumption. Our findings show that language models, when properly adapted, can offer powerful new ways to represent and understand the intricate choreography that enables cells to coordinate and build a living organism.