Phylogeny-Informed Random Forests for Human Microbiome Studies
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Random Forest is a widely used tree-based ensemble learning algorithm that efficiently captures complex nonlinear relationships and higher-order feature interactions with no distributional assumptions to be satisfied. It is also well-suited to human microbiome studies, where the data are highly skewed, overdispersed, discrete, and irregular. Here, I pay particular attention to the phylogenetic tree information that reflects evolutionary ancestry and functional relatedness among microbial features. Proper incorporation of phylogenetic tree information into microbiome data analysis has provided new insights and improved analytical performance. In this paper, I introduce an extension of the Random Forest algorithm that incorporates phylogenetic tree information, named Phylogeny-Informed Random Forests (PIRF), to improve predictive accuracy in human microbiome studies. The core mechanism of PIRF lies in its localized approach; rather than treating all features as competing globally to be selected or weighted, PIRF identifies informative features within each phylogenetic cluster (i.e., a localized group of microbial features that are evolutionarily and functionally related), thereby enriching functional representations while reducing tree correlation. I demonstrate the high predictive accuracy of PIRF, compared with other off-the-shelf tools, across seven benchmark tasks: four classification problems (gingival inflammation, immunotherapy response, type 1 diabetes, and obesity) and three regression problems (cytokine level, age based on oral microbiome, and age based on gut microbiome).
Importance
PIRF is an extension of the Random Forest algorithm that incorporates phylogenetic tree information to improve predictive accuracy in human microbiome studies. PIRF can serve as a useful tool for microbiome-based disease diagnostics and personalized medicine. The software and tutorials are freely available as an R package, named PIRF , at https://github.com/hk1785/PIRF .