MGM as a large-scale pretrained foundation model for microbiome analyses in diverse contexts

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Microbial communities significantly impact medicine, biotechnology, and agriculture. Advanced sequencing technologies have generated extensive microbiome data, enabling the discovery of substantial evolutionary and ecological patterns. However, traditional supervised learning methods struggle to capture universal patterns in microbial community data, largely due to the large data heterogeneity and profound batch effects among samples, rendering it difficult to classify samples as well as detect biomarkers from millions of samples, not to say the intricate but important dynamic patterns from a variety of contextualized sceneries. In this study, we introduce the Microbial General Model (MGM), the first microbiome community foundation model pre-trained on a dataset of 263,302 microbiome samples using language modeling techniques. MGM demonstrated significant improvements in microbial community classification compared to traditional machine learning methods. Additionally, MGM has enabled contextualized classification, effectively overcomes cross-regional limitations, showing enhanced performance on intercontinental datasets through transfer learning. Furthermore, fine-tuning MGM on a longitudinal infant dataset revealed distinct keystone genera during development, with Bacteroides and Bifidobacterium exhibiting higher attention weights in vaginal deliveries, and Haemophilus in cesarean deliveries. Finally, through in silico modeling, the model also uncovered novel microbial dynamic patterns in a Crohn’s disease cohort following antibiotic treatment. In conclusion, by leveraging self-attention and autoregressive pre-training, MGM serves as a versatile model for various downstream microbiome tasks and holds significant potential for achieving contextualized aims.

Key points

  • The Microbial General Model (MGM) is a foundation model with millions of parameters pre-trained on sub-million microbial community data.

  • MGM outperforms traditional methods in various microbiome classification and prediction tasks, such as microbial community classification.

  • MGM effectively captures the spatial and temporal dynamics of microbial communities.

  • MGM could detect the effects of perturbation on microbial community through in silico experiments.

Article activity feed