WitChi: Efficient Detection and Pruning of Compositional Bias in Phylogenomic Alignments Using Empirical Chi-Squared Testing
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Convergent evolution, where unrelated taxa independently evolve similar nucleotide or amino acid compositions, can introduce compositional bias into biological sequence data. Such biases distort phylogenetic inference, particularly in deep or unevenly sampled phylogenomic datasets. While composition-aware models can mitigate this issue, their computational demands often preclude their use in large-scale analyses. We present WitChi, a computationally efficient tool for identifying and removing compositionally biased alignment columns using empirical significance testing. WitChi calculates taxon-specific chi-squared (χ²) scores and compares them to null distributions derived from column permutations that preserve the phylogenetic structure of the alignment. Sites most responsible for significant deviation are iteratively pruned using one of three scoring algorithms, continuing until no taxa remain significantly biased or until further pruning would reduce the overall alignment χ² score below the expected null range. Z-scores and p-values are provided for both taxa and alignments, offering interpretable metrics of bias severity. Pruning of simulated compositional heterogeneous alignments show that WitChi reliably restores correct topologies under standard, compositionally stationary models. In benchmarks, WitChi outperforms or matches BMGE’s stationary-based trimming while scaling linearly with taxon number. Applied to the archaeal GTDB r220 dataset (5,869 taxa; 10,101 sites), WitChi completes pruning in under two hours on four CPU cores. The resulting phylogeny recovers key clades previously resolved only by in-depth analyses using complex models of sequence evolution. WitChi provides an efficient and scalable solution for detecting and removing compositional bias, enabling more accurate phylogenetic inference across the tree of life.