PhyloAug: An Evolutionary and Structure-Aware Data Augmentation Tool
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Genomic Language Models (GLMs) suffer from the inherent problem of data scarcity, due to the cost, time and complexity of wet-lab experiments. Data augmentation offers a solution; however traditional methods may unintentionally affect the underlying structure or function. By combining evolutionary signals with the RNA secondary structure, augmentations can retain original function, and remain structurally coherent. To implement this, we developed PhyloAug, a structure-aware, evolution-inspired augmentation method grounded in neutral theory. We employ Genomic Foundation Models (GFMs) to accurately perturb RNA sequences, and utilise phylogenetic analysis via PAML to provide site-wise restrictions based on evolutionary principles. These principles are obtained through the identification of evolutionarily neutral sites (sequence positions where mutations are unlikely to alter function), which are concatenated with the predicted (or known) secondary structure. We thereby ensure adherence to the underlying structure whilst enabling biologically plausible variation. To validate the biological validity of our augmentations, we compare our predicted neutral sites with Rfam-annotated conserved regions and assess sequence similarity to the underlying multiple sequence alignments. We next fine-tune GLMs on augmented data, yielding significant performance improvements up to 12.9% MCC, and 17.2% F1-Score.