Efficient Identification of Phylogenetically Informative Alignment Sites via Sparse Learning

Carlos G. Schrago

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Identifying phylogenetically informative sites in multiple sequence alignments is critical for accurate tree reconstruction and efficient data curation in phylogenomics. Existing approaches that measure phylogenetic information often rely on predefined topologies or heuristic criteria, limiting their generality and interpretability. Here, we introduce a novel, topology-agnostic framework for quantifying site-wise phylogenetic information using sparse learning via Lasso (Least Absolute Shrinkage and Selection Operator) regression. By modeling site log-likelihoods as predictors of the tree likelihood across a large ensemble of random topologies, our approach isolates the minimal subset of sites that meaningfully contribute to phylogenetic signal. We validate the method using both simulated and empirical mammalian datasets, demonstrating that Lasso-selected sites yield topologies nearly identical to those inferred from full alignments. For computational efficiency, we show that a simple entropy-based proxy (Shannon H ≥ 0.5) approximates Lasso results with high fidelity, enabling rapid site-level assessments. Importantly, our definition of phylogenetically informative sites provides an objective metric that can serve as a gold standard to evaluate commonly used alignment filtering tools. These findings establish sparse learning as a principled, scalable, and practical approach for assessing and optimizing phylogenetic data.

Version published to 10.1101/2025.07.24.666198 on bioRxiv
Jul 27, 2025

GTcomplex: Spatial indexing-powered search and alignment of macromolecular complexes

This article has 1 author:
1. Mindaugas Margelevicius
This article has no evaluationsLatest version Jan 22, 2026
Testing the validity and adequacy of linguistic phylogenetic analyses

This article has 1 author:
1. Benedict King
This article has no evaluationsLatest version Dec 17, 2025
Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

This article has 1 author:
1. Marvin I. De los Santos
This article has no evaluationsLatest version Dec 22, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

GTcomplex: Spatial indexing-powered search and alignment of macromolecular complexes

Testing the validity and adequacy of linguistic phylogenetic analyses

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary