Self-supervised learning enables robust microbiome predictions in data-limited and cross-cohort settings

Liron Zahavi
Zachary Levine
Anastasia Godneva
Veronika Dubinkina
Raja Dhir
Katherine S. Pollard
Adina Weinberger
Eran Segal

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The gut microbiome plays a crucial role in human health, but machine learning applications in this field face significant challenges, including limited labeled data availability, high dimensionality, and batch effects across different cohorts. To address these limitations, we developed representation learning models for gut microbiome metagenomic data, drawing inspiration from foundation models approaches based on self-supervised and transfer learning principles. By leveraging a large collection of 85,364 metagenomic samples, we implemented multiple self-supervised learning methods, including masked autoencoders with varying masking rates and adapted single-cell RNA sequencing models (scVI and scGPT), to generate embeddings from bacterial abundance profiles. These learned representations demonstrated significant advantages over raw bacterial abundances in two key scenarios: first, when training predictive models with very limited labeled data, improving prediction performance for age (r = 0.14 vs. 0.06), BMI (r = 0.16 vs. 0.11), visceral fat mass (r = 0.25 vs. 0.18), and drug usage classification (PR-AUC = 0.81 vs. 0.73); and second, when generalizing predictions across different cohorts, consistently outperforming models based on raw abundances in cross-dataset evaluation. Our approach provides a valuable framework for leveraging self-supervised representation learning to overcome the data limitations inherent in microbiome research, potentially enabling more robust and generalizable machine learning applications in this field.

Version published to 10.1101/2025.10.19.683269 on bioRxiv
Oct 20, 2025

Machine Learning Models for Preterm Birth Prediction Using Vaginal Microbiome Profiles in a Mexican Cohort

This article has 5 authors:
1. Martín Ruhle
2. Felipe Vadillo-Ortega
3. Carolina Espinosa-Maldonado
4. Guillermo de Anda-Jáuregui
5. Enrique Hernández-Lemus
This article has no evaluationsLatest version Jan 13, 2026
Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature

This article has 2 authors:
1. Rafael de Negreiros Botan
2. Joao Batista de Sousa
This article has no evaluationsLatest version Dec 19, 2025
A Digital Twin with Transfer Learning Enables Cross-Anatomical Forecasting of Postmortem Microbiome Dynamics for PMI Estimation

This article has 6 authors:
1. Kang Ning
2. Jin Han
3. Yuli Zhang
4. Haohong Zhang
5. Kouyi Zhou
6. Xiaoke Chen
This article has no evaluationsLatest version Jan 29, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Machine Learning Models for Preterm Birth Prediction Using Vaginal Microbiome Profiles in a Mexican Cohort

Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature

A Digital Twin with Transfer Learning Enables Cross-Anatomical Forecasting of Postmortem Microbiome Dynamics for PMI Estimation