Metagenomic classification of ancient viruses

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Ancient DNA (aDNA) sequences present unique challenges for taxonomic classification due to extreme fragmentation (reads 20-100 bp), end-biased cytosine deamination, and high contamination rates. Conventional metagenomic classifiers based on exact k -mer matching or alignment lose discriminative power on such short and damaged reads, limiting the analysis of paleogenomic samples. We present FALCON2, a compression-based metagenomic classifier that leverages position-aware finite-context models to maintain high accuracy on degraded viral ancient viruses. FALCON2 consolidates the capabilities of its predecessor, FALCON-meta, into a unified executable with enhanced features including model persistence, direct processing of compressed inputs, multiple file handling, and optional pre-filtering methodologies for contaminated samples. Under controlled benchmarking with database, taxonomy, and thread parity on simulated viral datasets, FALCON2 achieved an Area Under the Curve of Receiver Operating Characteristic (AUC-ROC) of 0.999, an Area Under Precision-Recall Curve (AUPRC) of 0.968, and an F 1 -score of 0.918, substantially outperforming Centrifuge (AUPRC = 0.625), Kraken2 (AUPRC = 0.184), and CLARK-S (AUPRC = 0.013) on pooled micro-averaged metrics. FALCON2’s advantage is most pronounced on ultra-short reads (20-40 bp), where exact k -mers become sparse. FALCON2 pre-filtering at threshold 0.7 improved precision by 10 percentage points with negligible recall loss. FALCON2 runs on systems with 4-8 GB RAM for typical analyses. FALCON2 is freely available at https://github.com/cobilab/FALCON2 under GPL v3 license.

Article activity feed