A novel Vector-Symbolic Architecture for graph encoding and its application to viral pangenome-based species classification

Fabio Cumbo
Kabir Dhillon
Jayadev Joshi
Davide Chicco
Sercan Aygun
Daniel Blankenberg

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Viral species classification is crucial for understanding viral evolution, epidemiology, and developing effective diagnostics and treatments. Traditional methods often rely on sequence similarity, which can be challenging for rapidly evolving viruses. Pangenomes, offering a comprehensive representation of species’ genomic diversity, provide a richer perspective, but their analysis often requires advanced computational methods. We investigate the use of Hyperdimensional Computing (HDC), also known as Vector-Symbolic Architecture (VSA), an emerging computing paradigm that relies on vectors in high-dimensional spaces to encode a multi-species viral pangenome.

We develop a new method for encoding graph-structured viral pangenomes using high-dimensional vectors. Pangenomes are represented as weighted de Bruijn graphs constructed using sequences of consecutive k-mers from the genomes, while information about the genome species (their class) is encoded as specific weights on the edges of the graph. The weighted de Bruijn graph representation is encoded into a single high-dimensional vector. We tested three classification strategies: a flat model at the species level, a flat model at the genus level, and a two-step hierarchical model.

We applied our method to a pangenome comprising 542 viral species from NCBI GenBank. Our results reveal a complex relationship between model architecture and classification accuracy. The flat species-level model achieved the highest accuracy, correctly classifying 87.08% of test genomes. Counter-intuitively, simplifying the problem to the genus level or using a hierarchical approach degraded performance, with accuracies dropping to 60.51% and 33.57% respectively. These outcomes highlight critical challenges in alignment-free classification, such as signal dilution in overly broad taxonomic groups and error propagation in multi-step models. The model’s reconstruction rate proved to be a reliable measure of confidence, rather than a direct predictor of correctness.

This novel approach offers a promising new direction for viral classification, not only for its predictive power but its ability to reveal underlying challenges in genomic taxonomy.

Version published to 10.1101/2025.09.08.674958 on bioRxiv
Sep 10, 2025

Retrieval-Based AI Framework for Viral Genomic Analysis

This article has 3 authors:
1. Ahmed M. Fahmy
2. Melissa Ayad
3. Hassan M. Ahmed
This article has no evaluationsLatest version Jan 29, 2026
Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

This article has 1 author:
1. Marvin I. De los Santos
This article has no evaluationsLatest version Dec 22, 2025
Benchmarking Genomic Foundation Models for Gene Fusion Detection from DNA Sequences

This article has 5 authors:
1. Radim Krupička
2. Mariana Komárková
3. Bohuslav Dvorský
4. Kateřina Kollinová
5. Ondřej Klempíř
This article has no evaluationsLatest version Dec 23, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Retrieval-Based AI Framework for Viral Genomic Analysis

Rapid Phylogenomic Analysis of Thousands Outbreak‐Causing Viral Genomes Using Covary

Benchmarking Genomic Foundation Models for Gene Fusion Detection from DNA Sequences