MetagenBERT: a Transformer Architecture using Foundational DNA Read Embedding Models to enhance Disease Classification
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Microbial ecosystems constitute complex yet information-rich environments whose characterization is crucial for understanding host health and disease. Among them, the human gut microbiome has emerged as a key "super-integrator", owing to its dense interactions with host physiology and its established associations with a wide spectrum of pathologies. Driven by advances in high-throughput sequencing technologies and the continuous decline in associated costs, metagenomic studies have expanded exponentially, generating massive amounts of sequencing data and opening new avenues for data-driven disease modeling. Conventional approaches to microbiome analysis predominantly rely on the alignment of DNA sequencing reads against reference databases to infer microbial composition and profiling at the species level. While effective, these methods are inherently constrained by reference bias and limited taxonomic resolution. Recent advances in artificial intelligence—particularly in Natural Language Processing (NLP) offer new methodological perspectives for metagenomic data representation. In this study, we present MetagenBERT, a Transformer-based framework to embed metagenomes that relies on the foundational models DNABERT-2 and DNABERT-S for the embedding of DNA sequencing reads. Our approach encodes gut microbiome metagenome in a taxonomy-agnostic manner, enabling direct downstream application to disease classification tasks. We demonstrate that MetagenBERT reaches similar performance to state-of-the-art abundance-based models for cirrhosis prediction and surpasses them in the more challenging context of type 2 diabetes. Furthermore, we introduce an alternative representation of metagenomes based on read-level embeddings aggregated into abundance vectors, demonstrating their complementarity with conventional species-level abundance metrics.