Self-supervised learning enables robust microbiome predictions in data-limited and cross-cohort settings

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The gut microbiome plays a crucial role in human health, but machine learning applications in this field face significant challenges, including limited labeled data availability, high dimensionality, and batch effects across different cohorts. To address these limitations, we developed representation learning models for gut microbiome metagenomic data, drawing inspiration from foundation models approaches based on self-supervised and transfer learning principles. By leveraging a large collection of 85,364 metagenomic samples, we implemented multiple self-supervised learning methods, including masked autoencoders with varying masking rates and adapted single-cell RNA sequencing models (scVI and scGPT), to generate embeddings from bacterial abundance profiles. These learned representations demonstrated significant advantages over raw bacterial abundances in two key scenarios: first, when training predictive models with very limited labeled data, improving prediction performance for age (r = 0.14 vs. 0.06), BMI (r = 0.16 vs. 0.11), visceral fat mass (r = 0.25 vs. 0.18), and drug usage classification (PR-AUC = 0.81 vs. 0.73); and second, when generalizing predictions across different cohorts, consistently outperforming models based on raw abundances in cross-dataset evaluation. Our approach provides a valuable framework for leveraging self-supervised representation learning to overcome the data limitations inherent in microbiome research, potentially enabling more robust and generalizable machine learning applications in this field.

Article activity feed