Deciphering enzymatic potential in metagenomic reads through DNA language models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Microbial communities drive essential global processes, yet much of their functional potential remains unexplored. Metagenomics stands to elucidate this microbial “dark matter” by directly sequencing the microbial community DNA from environmental samples. However, the exploration of metagenomic sequences is mostly limited to establishing their similarity to curated reference sequences. A paradigm shift - language model (LM) -based methods - offer promising avenues for reference-free analysis of metagenomic reads. Here, we introduce two LMs, a pretrained foundation model REMME, aimed at understanding the DNA context of metagenomic reads, and the fine-tuned REBEAN for predicting the enzymatic potential encoded within the read-corresponding genes. By emphasizing function recognition over gene identification, REBEAN labels gene-encoded molecular functions of previously explored and new (orphan) sequences. Even though it was not trained to do so, REBEAN identifies the gene’s function-relevant parts. It thus expands enzymatic annotation of unassembled metagenomic reads. Here, we present novel enzymes discovered using our models, highlighting model impact on our understanding of microbial communities.