Tolerance Regions for Compositional Data with Application to Reference Regions for Healthy Microbiome Profiles
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Imbalances in the human microbiome are associated with numerous diseases, highlighting the need for benchmarks that define healthy microbiome composition and identify abnormal deviations. Although the microbiome is increasingly studied as a potential clinical marker, statistical approaches for constructing reference regions of healthy microbiome composition remain relatively underexplored. This work develops statistical methods to construct reference regions for healthy microbiome data, addressing three main challenges. First, since microbiome data contain relative rather than absolute information, standard statistical methods are not directly appropriate. Therefore, microbiome profiles are treated as compositional data satisfying a sum constraint, and log-ratio transformations are used to analyze them in real space while preserving their relative structure. Second, reference regions are constructed as tolerance regions rather than confidence regions, so that they cover a pre-specified proportion of the healthy population with a given confidence level. The proposed framework incorporates both parametric and nonparametric approaches for constructing these tolerance regions. Parametric methods are considered when the ilr-transformed data approximately follow an elliptical distribution, where they can yield smaller regions while maintaining the desired coverage. Nonparametric approaches provide a flexible alternative by avoiding distributional assumptions. Third, because microbiome data are multidimensional and difficult to interpret, quantitative and graphical tools are introduced to assess atypicality and identify which microbial taxa contribute most to deviations from healthy profiles. Simulation studies are conducted to evaluate the performance of the proposed methods. The methodology is then demonstrated by constructing reference regions for healthy microbiome profiles using real-world data. Finally, the approach is applied to microbiome datasets comparing healthy and patient profiles to assess whether patient samples are identified as atypical and to examine which taxa contribute to these deviations. Overall, the proposed framework provides a clear and statistically robust approach for defining healthy microbiome reference regions and detecting atypical microbiome profiles.