Retrieval-Based AI Framework for Viral Genomic Analysis

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid growth of genomic sequencing demands fast, accurate, and scalable analysis methods. In applications like viral lineage assignment and antimicrobial resistance surveillance, constantly emerging variants make supervised models expensive to retrain and tied to fixed label sets, motivating retrieval-based genomic classification as a simpler, more flexible alternative. We propose retrieval-based genomic sequence classification as a new task, and study it alongside standard supervised approaches across three problems: Hepatitis C virus (HCV) genotyping, COVID-19 discrimination, and Human papillomavirus (HPV) genotyping. We compare standard sequence encodings (one-hot, k -mers, FCGR) with dense embeddings (dna2vec, DNABERT). For each representation, we evaluate supervised classifiers (Random Forest, Decision Tree, XGBoost) and retrieval-based classification, where sequence vectors are indexed with FAISS and labels are assigned via similarity-weighted k-NN. Furthermore, we benchmark multiple FAISS index types (Flat, IVF, HNSW, IVFPQ, OPQ) to characterize accuracy-speed–memory trade-offs at scale. Our results show that supervised XGBoost and retrieval over Flat/IVF indexes often achieve excellent accuracy with different compute and memory profiles. In terms of compressed indexes (IVFPQ, OPQ), they provide substantial memory savings with moderate accuracy loss. Across tasks, XGBoost offers the best accuracy–size trade-off, while retrieval-based classification remains competitive with minimal training and flexible index updates. Our unified benchmark and encoder-agnostic pipeline provide practical guidance on the scenarios in which dense retrieval can match or replace traditional classifiers for scalable genomic sequence analysis.

Article activity feed