Machine Learning Models for Preterm Birth Prediction Using Vaginal Microbiome Profiles in a Mexican Cohort
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Preterm birth (PTB, <37 weeks of gestation) affects approximately 10% of pregnancies in Mexico and remains a leading cause of neonatal morbidity and mortality worldwide. The vaginal microbiome has emerged as a potential biomarker of PTB risk, with dysbiotic states characterized by reduced Lactobacillus dominance and increased microbial diversity implicated in inflammatory pathways leading to premature parturition. However, Hispanic/Latino populations remain severely underrepresented in microbiome-based PTB prediction research, limiting clinical translation of existing models. Methods: We developed and evaluated machine learning models for PTB prediction using vaginal microbiome data from 43 pregnant Mexican women (110 longitudinal samples, 14 preterm births <37 weeks). Genus-level relative abundances were processed using centered log-ratio transformation within a rigorous nested cross-validation framework with subject-level splitting to prevent data leakage. We systematically compared Random Forest and Elastic Net algorithms across three clinical feature selection strategies: (1) minimal DREAM-style adjustment (gestational age + maternal age); (2) literature-based comprehensive features (10 evidence-based PTB risk factors); and (3) data-driven empirical selection (top 10 variables selected independently within each cross-validation fold using univariate screening with p<0.20). Microbiome features included either ANCOM-BC2 differentially abundant taxa (selected independently within each fold) or full filtered profiles. Critically, all feature selection procedures were executed within cross-validation folds using only training data, ensuring unbiased performance estimates. Results: Random Forest with data-driven feature selection and full microbiome achieved optimal discrimination (AUROC 0.849 ± 0.130; PRAUC 0.571 ± 0.208), with sensitivity 80.0% and specificity 47.3% at the optimized threshold. This performance exceeds the DREAM Challenge benchmark for late PTB (AUROC 0.69) despite substantially smaller sample size. Feature importance analysis identified anthropometric variables (BMI, pre-pregnancy weight) and key microbial genera (Methylobacterium, Lactobacillus, Anaerococcus) as primary drivers. ANCOM-BC2 analysis across cross-validation folds revealed consistent enrichment of Peptostreptococcus (selected in 100% of folds) and Mycoplasma (80% of folds) in preterm births—taxa mechanistically linked to Toll-like receptor activation, pro-inflammatory cytokine production, and matrix metalloproteinase-driven cervical remodeling. Conclusions: A machine learning-based PTB prediction model here developed specifically for a Mexican cohort, demonstrates feasibility of microbiome-based risk stratification in Latin American populations. The rigorous nested cross-validation with fold-specific feature selection prevented data leakage that has inflated performance in previous studies. However, limited sample size (wide confidence intervals, SD 0.13–0.25) points out to the need of further studies, in particular, external validation in larger, independent cohorts before broad clinical implementation. This work addresses a critical equity gap and establishes a methodological framework for population-specific precision medicine in pregnancy complications.