Seq2KING: An unsupervised internal transformer representation of global human heritages
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Determining the intricate tapestry of human genetic relationships is a central challenge in population genetics and precision medicine. We propose that the principles of lexical connectivity, which words derive meaning from their contextual interactions, can be adapted to genetic data, enabling transformer models to reveal that individuals with higher genetic similarity form stronger latent connections. We explored this by transposing KING kinship-related matrices into the (query, key, value) QKV latent space within transformer models and determined that attention mechanisms can capture genetic relatedness in an unsupervised fashion. We found that individuals had an attention weight connectivity of 85.34% (p<0.05) if they were from within the same continent, compared to if they were from other continents. Surprisingly, we found that some encoder layers required inversion of their latent representations for this connectivity to become obvious. Lastly, we used BERTViz to create human-readable hyper-dense connectivity patterns among individuals. Our approach is purely based on attention, which yields a non-discrete spectrum of relatedness, and thus uncovers patterns on first principles. Seq2KING addresses the significant challenge of discovering population structures to construct a global human relatedness map, without relying on predefined labels. Our excavation into the latent space is a paradigm shift from legacy-supervised genetic methodologies, which presents a new way to understand the human pangenome as well as discern population substructures for creating precision genetic medicines.
Non-Expert Description
Is it possible to build artificial intelligence (AI) to read the human genome as a first language? Why would one want such AI? We at Ecotone believe that such AI will provide the genetic coordinates needed to manufacture CRISPRs medications to cure about ∼10,000 genetic diseases. How does one build such AI? Our recently released model dnaSORA proposed a means to assign meaning to every single token (typically referred to as a base) of all 3 billion tokens in the human genome (Koreniuk & Njie, 2025). This builds the vocabulary for reading the human genome as a first language. For dnaSORA to work, it needs to know the heritages of people that are in its model of our genetics. We mostly rely on country, culture and geography to determine our heritages, but this is too error-prone for dnaSORA. Also error-prone in our experience are legacy genetic approaches such as those used by 23andMe.
Our research here introduces Seq2KING, a new artificial intelligence method that is based on excavating the insides of transformers to uncover hidden patterns of genetic relatedness among people around the world—without needing any prior labels or categorizations. The key innovation of Seq2KING is applying the principles of lexical connectivity— how words derive meaning through their relationships to other words—to genetic data. Just as “dog” gains meaning through its connections to words like “pet,” “animal,” and “loyal,” we show that individuals’ genomes can be understood through their genetic connections to others. We start by converting raw genetic data into a compact kinship matrix (using a tool called KING) that summarizes how closely everyone is related. We then feed these kinship values into a transformer model—the same kind of AI behind cutting-edge language tools like ChatGPT.
Inside the transformer, special components called “attention heads” learn which individuals are most similar, strengthening links between people from the same region and showing subtler connections across continents. Unlike legacy approaches that rely on discrete pre-defined categories, Seq2KING provides continuous measures of relatedness, allowing us to visualize connections between any individual and all other humans. Additionally, because Seq2KING operates directly within the transformer’s internal reasoning system, it can be seamlessly integrated as a component within larger genome interpretation systems—essentially functioning like high-speed cache memory for heritage assignments, dramatically improving both efficiency and scalability. By examining these attention patterns, we can reconstruct familiar population groupings—such as European, African, and Asian heritage—entirely by the model’s internal logic. Finally, we use a visualization technique (BERTViz) to turn these dense connection maps into intuitive diagrams that highlight population connections between individuals.
Because our approach doesn’t rely on pre-assigned labels, it offers a truly unbiased way to explore human population structure. This could help scientists trace migration routes that resulted in the peopling of the continents, find subtle subgroups within larger populations, and remove “background noise” in genetic studies of disease. Ultimately, Seq2KING paves the way for more precise genetic maps of all humans, revealing the natural “family trees” hidden in our DNA and bringing us one step closer to reading the human genome as a first language.