Covary: A translation-aware framework for alignment-free phylogenetics using machine learning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In large-scale phylogenetic analysis, incorporating translation awareness is critical to account for the genotypic and phenotypic dimensions underlying biological diversification. Covary is a machine learning-based framework that analyzes, clusters, and compares genetic sequences through alignment-free, translation-aware embeddings. By integrating codon-boundary and intra-sequence positional information into a unified vector representation, Covary encodes mutational patterns alongside translation-level constraints. This design enables discrimination of frameshift-inducing mutations, substitutions, and other biologically meaningful sequence variations relevant to evolutionary relationships. Despite inherent sensitivity to k-mer -based distortions, Covary accurately clustered sequences, identified species, and reconstructed phylogenetic trees across diverse datasets, including human TP53 variants, ribosomal gene markers (18S and 16S), and complete genomes from viral, bacterial, and archaeal taxa. The resulting topologies were comparable to those produced by multiple sequence alignment (MASA)-based implementations like ETE3, with near-linear scalability demonstrated by the successful analysis of nearly a thousand SARS-CoV-2 genomes within minutes. The versatility and interpretability of Covary across mutation-, gene-, and genome-level analyses underscore its potential as a biologically informed, data-driven tool for bioinformatics, comparative genomics, taxonomy, ecology, and evolutionary studies. Covary is available online at https://github.com/mahvin92/Covary or at https://covary.chordexbio.com .