Deciphering enzymatic potential in metagenomic reads through DNA language model

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Earth’s microbial world plays a fundamental role in shaping the biosphere, steering global processes, soil rejuvenation, and ecological fortification. An overwhelming majority of microbial entities, however, remain un(der)studied. Metagenomics stands to elucidate this microbial “dark matter” by directly sequencing the microbial community DNA from environmental samples. Yet, our ability to explore metagenomic sequences is mostly limited to establishing their similarity to curated datasets of organisms or genes/proteins. Aside from the difficulties in establishing such similarity, the reference-based approaches, by definition, forgo discovery of any entities sufficiently unlike the reference collection.

Presenting a paradigm shift, language model-based methods, offer promising avenues for reference-free analysis of metagenomic reads. Here, we introduce two language models, a pretrained foundation model REMME, aimed at understanding the DNA context of metagenomic reads, and the fine-tuned REBEAN model for predicting the enzymatic potential encoded within the read-corresponding genes. By emphasizing function recognition over gene identification, REBEAN is able to label molecular functions carried both by previously explored genes and by new (orphan) sequences. Inherently, REBEAN identifies the functionally relevant parts of a gene even though it is not explicitly trained to do so. It thus expands enzymatic annotation of unassembled metagenomic reads from extreme environments while maintaining consistency with available annotation methods. Here, our comprehensive analysis highlights our models’ potential for metagenomic read annotation and unearthing of novel enzymes, thus enriching our understanding of microbial communities.

Article activity feed