Learning the DNA syntax of human microbiomes to infer health and disease
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The human microbiome is a key factor in human health and alterations in community structure are associated with diverse pathological conditions. However, defining universal criteria to distinguish healthy from altered microbiome configurations remains challenging due to inter- and intra-individual variability, database-dependent approaches, and the complexity of analyzing numerous microbial features simultaneously. Here, we developed an approach that learns the syntax of the entire DNA of human microbial communities, using Sequence-Informed GC-normalized 4-mers (SIG-mers) that feed into statistical and machine learning frameworks. We identified distinct SIG-mer signatures that differentiate microbiomes of body sites across diverse healthy human populations. These signatures reveal both global microbiome shifts and individual-specific dynamics in response to antibiotic treatments and in chronic inflammatory disease. Leveraging machine learning models, we inferred health- and disease-associated microbiome states from SIG-mer profiles, capturing the degree of perturbation and disease severity. Our findings highlight SIG-mer profiling as a robust, unbiased and broadly applicable approach for personalized microbiome diagnostics and guiding therapeutic interventions.