Deciphering enzymatic potential in metagenomic reads through DNA language model
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The microbial world plays a fundamental role in shaping Earth’s biosphere, steering global processes such as carbon and nitrogen cycling, soil rejuvenation, and ecological fortification. An overwhelming majority of microbial entities, however, remain unstudied. Metagenomics stands to elucidate this microbial “dark matter” by directly sequencing the microbial community DNA from environmental samples. Yet, our ability to explore these metagenomic sequences is limited to establishing their similarity to curated datasets of organisms or genes/proteins. Aside from the difficulties in establishing such similarity, the reference-based approaches, by definition, forgo discovery of any entities sufficiently unlike the reference collection.
Presenting a paradigm shift, language model-based methods, offer promising avenues for reference-free analysis of meta-genomic reads. Here, we introduce two language models, a pretrained foundation model REMME, aimed at understanding the DNA context of metagenomic reads, and the finetuned REBEAN model for predicting the enzymatic potential encoded within the read-corresponding genes. By emphasizing function over gene identification, REBEAN is able to label known functions carried both by previously explored genes and by new (orphan) sequences. Furthermore, even though it is not explicitly trained to do so, REBEAN identifies the functionally relevant parts of a gene. Our comprehensive analysis highlights our models’ potential for metagenomic read annotation and unearthing of novel enzymes, thus enriching our understanding of microbial communities.