The Second Brain: Diffusion Models for Realistic Human Microbiome Generation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The human microbiome is a critical determinant of health and disease, but microbiome machine learning is constrained by limited data availability, heterogeneous cohort coverage, and privacy risks from individually identifying microbial signatures. Synthetic microbiome generation could support method development and privacy-preserving sharing, provided that generated samples preserve the ecological zero-inflation of real communities. We present a diffusion-based generative model with a sparsity-preserving decoder built around two sparsity-focused mechanisms: (1) prevalence-aware bias initialization that anchors per-taxon presence probabilities to observed prevalences from epoch one; and (2) a hard sparsity loss implemented with straight-through gradient estimators. The implementation also uses hyperbolic taxonomic embeddings as an unvalidated, phylogeny-aware architectural prior in the diffusion backbone. Evaluated on the American Gut Project (4,827 samples, 500 taxa), the full 15.2M-parameter model achieves parametric-level sparsity preservation: 1.4% deviation in the main comparison and 2.6%±0.5% deviation across three AGP seeds. SparseDOSSA2 achieves the lowest sparsity deviation in this comparison (0.7%), and MIDASim also passes the operational sparsity threshold (4.9%). Among the three threshold-passing methods, MIDASim achieves the best ecological distance scores, SparseDOSSA2 is best on sparsity deviation, and our model achieves the best prevalence correlation (0.996) while narrowly improving on SparseDOSSA2 on Bray–Curtis (0.0485 vs. 0.0495) and UniFrac (0.0400 vs. 0.0435) discrepancies. PERMANOVA remains able to distinguish generated from real AGP samples ( F = 64.29), which we treat as an important limitation rather than evidence of indistinguishability. These results support a deliberately narrow conclusion: this is, to our knowledge, the first deep generative model to match parametric-level sparsity preservation for human microbiome profiles while remaining competitive on standard ecological distance metrics.