Geometric averaging provides normalization-invariant feature ranking in compositional sequencing data

Emilia Nunzi
Luigina Romani

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

In compositional next-generation sequencing (NGS) analyses (including microbiome studies, RNA-seq and metagenomics) the arithmetic mean (AM) of relative proportions is the default operator for summarizing feature abundances. We show that this default produces unstable rankings in real compositional data. Across 102 prevalent genera in the dietswap dataset (n=38 baseline samples), 23 genera (22.5%), including members of Bacteroides , Eubacterium and Bilophila , yielded opposite group-level conclusions under AM and the geometric mean (GM).

This pattern reflects two formal properties of compositional aggregation. First, AM-based rankings change with the within-sample normalization domain, whereas GM-based rankings are invariant under the multiplicative structure of compositional data. Second, the centered log-ratio (CLR) transformation absorbs geometric averaging into the data representation, so that arithmetic averaging on CLR-space recovers the GM ranking exactly. Both properties were verified numerically on the dietswap dataset, where the Spearman correlation between GM- and CLR-based rankings was 1.000 in both groups.

The operator-choice problem propagates to between-group differential inference: under AM, log ₂ fold-changes vary across normalizations and the relative ranking of features by effect size is not preserved; under GM and CLR, the ranking is preserved. We recommend GM-based summaries for feature ranking and CLR-transformed abundances for cross-sample comparisons. This change requires no new computational tools and is fully compatible with existing differential-abundance pipelines, but eliminates an under-recognized source of irreproducibility in biomarker discovery across microbiome studies, transcriptomics, metagenomics, and mass-spectrometry-based metabolomics, in all settings where features are quantified relative to a sample total.

IMPORTANCE

Studies of the gut microbiome routinely identify which bacterial groups are more or less abundant in patients versus healthy controls, in different diets, or before and after a treatment. The same kind of comparison underlies sequencing-based analyses across biology, from gene expression to metagenomics. To do this, researchers must average the abundance of each measured entity across many samples, and the standard choice is the simple arithmetic average. We show that this choice can be misleading for any data where each measurement is expressed relative to a sample total, as is typical of sequencing-based assays, and that in real data it can flip the answer to which group is more enriched. Analyzing a published dietary intervention study, we found that one in five gut bacteria (including Bacteroides and Eubacterium ) gave opposite results depending on which average was used. Switching to the geometric average resolves this inconsistency and makes biomarker discovery more reproducible. This change is immediate to implement (it does not require new software or specialized training) and applies not only to microbiome studies, but to any biological measurement where what is detected, whether a gene transcript, a microbial taxon, or a metabolite, is quantified relative to a sample total: gene-expression analysis, metagenomics, and metabolomics among others.

Version published to 10.64898/2026.05.16.725171 on bioRxiv
May 19, 2026

Ratio Percentile Deviation (RPD): A nonparametric, compositionally robust method for measuring the divergence of a microbial sample from a reference dataset

This article has 1 author:
1. Cristina M. Herren
This article has no evaluationsLatest version May 27, 2026
Understanding the bias of compositional microbiome differential abundance estimation

This article has 3 authors:
1. M.Luz Calle
2. Meritxell Pujolassos
3. Antoni Susin
This article has no evaluationsLatest version Apr 30, 2026
Tolerance Regions for Compositional Data with Application to Reference Regions for Healthy Microbiome Profiles

This article has 2 authors:
1. Nisansala Wickramasinghe
2. Pankaj Choudhary
This article has no evaluationsLatest version May 7, 2026

Discuss this preprint

Listed in

Abstract

IMPORTANCE

Article activity feed

Related articles

Ratio Percentile Deviation (RPD): A nonparametric, compositionally robust method for measuring the divergence of a microbial sample from a reference dataset

Understanding the bias of compositional microbiome differential abundance estimation

Tolerance Regions for Compositional Data with Application to Reference Regions for Healthy Microbiome Profiles