Geometric averaging provides normalization-invariant feature ranking in compositional sequencing data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In compositional next-generation sequencing (NGS) analyses (including microbiome studies, RNA-seq and metagenomics) the arithmetic mean (AM) of relative proportions is the default operator for summarizing feature abundances. We show that this default produces unstable rankings in real compositional data. Across 102 prevalent genera in the dietswap dataset (n=38 baseline samples), 23 genera (22.5%), including members of Bacteroides , Eubacterium and Bilophila , yielded opposite group-level conclusions under AM and the geometric mean (GM).
This pattern reflects two formal properties of compositional aggregation. First, AM-based rankings change with the within-sample normalization domain, whereas GM-based rankings are invariant under the multiplicative structure of compositional data. Second, the centered log-ratio (CLR) transformation absorbs geometric averaging into the data representation, so that arithmetic averaging on CLR-space recovers the GM ranking exactly. Both properties were verified numerically on the dietswap dataset, where the Spearman correlation between GM- and CLR-based rankings was 1.000 in both groups.
The operator-choice problem propagates to between-group differential inference: under AM, log 2 fold-changes vary across normalizations and the relative ranking of features by effect size is not preserved; under GM and CLR, the ranking is preserved. We recommend GM-based summaries for feature ranking and CLR-transformed abundances for cross-sample comparisons. This change requires no new computational tools and is fully compatible with existing differential-abundance pipelines, but eliminates an under-recognized source of irreproducibility in biomarker discovery across microbiome studies, transcriptomics, metagenomics, and mass-spectrometry-based metabolomics, in all settings where features are quantified relative to a sample total.
IMPORTANCE
Studies of the gut microbiome routinely identify which bacterial groups are more or less abundant in patients versus healthy controls, in different diets, or before and after a treatment. The same kind of comparison underlies sequencing-based analyses across biology, from gene expression to metagenomics. To do this, researchers must average the abundance of each measured entity across many samples, and the standard choice is the simple arithmetic average. We show that this choice can be misleading for any data where each measurement is expressed relative to a sample total, as is typical of sequencing-based assays, and that in real data it can flip the answer to which group is more enriched. Analyzing a published dietary intervention study, we found that one in five gut bacteria (including Bacteroides and Eubacterium ) gave opposite results depending on which average was used. Switching to the geometric average resolves this inconsistency and makes biomarker discovery more reproducible. This change is immediate to implement (it does not require new software or specialized training) and applies not only to microbiome studies, but to any biological measurement where what is detected, whether a gene transcript, a microbial taxon, or a metabolite, is quantified relative to a sample total: gene-expression analysis, metagenomics, and metabolomics among others.