Learning gene interactions and functional landscapes from entire bacterial proteomes

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Unraveling complex gene interactions and understanding their functions in the genomes of bacteria will provide critical advancements in fields including bacterial genome evolution, microbiome studies, as well as drug and natural product discovery. This is a challenging problem due to the structural and functional complexity of bacterial genomes, and issues including poor gene annotation in non-model species. Language models (LMs) provide a feasible framework for learning the complex interactions among genes from a large number of publicly available, unannotated bacterial genomes. However, applications of language models have mostly been limited to developing models trained on short genomic sequences. Here, we introduce the first whole bacteria proteome foundation model to our knowledge. Our model was trained on ESM embeddings of tens of thousands of full-size proteomes and can generate contextual embeddings for individual proteins as well as embeddings representing the entire genome. We show that our model captures gene-gene interactions and genomic integrity. We further demonstrate that the learned embeddings can be used to achieve state-of-the-art performances for downstream tasks such as identifying operons, and predicting genotype-phenotype maps.

Article activity feed