Capturing the Mutational Dynamics of SARS-CoV-2 with Graphs

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid evolution of SARS-CoV-2 presents significant challenges for modeling viral dynamics, driven by lineage diversification and region-specific mutation patterns. While phylogenetic trees are traditionally used for evolutionary inference, the massive volume of SARS-CoV-2 genomic data, with many similar sequences and few distinguishing mutations, poses computational and methodological limitations. The quasispecies theory instead models viral evolution as a cloud of mutants, motivating a graph-based representation that better captures the complexity of mutational events. Geographic variation adds another critical layer to this complexity. Mutation trends often differ across regions due to local transmission dynamics, host population structures, and selective pressures. In this study, we present the Mutation Learning Graph (MLG), a directed graph framework that organizes SARS-CoV-2 variants based on their cumulative mutation profiles relative to the reference genome (NC_045512.2), thereby capturing the dynamics of mutation propagation. This structure captures fine-grained mutational transitions and encodes plausible evolutionary relationships among variants. To construct these graphs, we introduce an alignment-aware mutation profiling method and a novel ANCESTOR JOINING algorithm, which incorporates ancestral variants as inferred intermediate nodes to connect observed genomes through biologically coherent mutational paths. We generate MLG datasets for ten geographically and epidemiologically diverse regions and benchmark them on two graph-based tasks: node-level lineage classification and edge-level mutational transition prediction. Using baseline graph neural network architectures (GCN, GraphSAGE, GAT, GGNN, VGAE), we demonstrate how mutation-centric graph structures expose key biological challenges, such as lineage imbalance and location-specific mutation spectra. For node classification, GraphSAGE and GGNN consistently achieved high accuracy (up to 0.96) and AUROC (up to 0.98). In contrast, VGAE and GraphSAGE led the way in link prediction, with AUPRCs of up to 0.96. These results highlight the effectiveness of MLG for capturing biologically meaningful mutation patterns and underscore the importance of localized, mutation-aware modeling for predicting viral mutations and future variant emergence.

Article activity feed