Expanding the RNA Virus Universe by Scalable Structure-Guided Discovery
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The discovery of RNA viruses from metatranscriptomic data remains challenging due to their extreme sequence divergence and frequent lack of conserved motifs. We present Rider, a lightweight two-stage framework that couples fast, structure-informed sequence screening with targeted structural validation. Stage 1 uses a compact 35M-parameter protein language model to prioritize RdRp-like fragments at whole-sample scale, achieving over 44× higher end-to-end screening throughput on commodity hardware. Stage 2 applies structure prediction and Foldseek-based alignment against a dedicated RdRp structure resource (∼200k ESMFold-predicted structures), providing orthogonal evidence for remote homologs. Applied to >10,000 metatranscriptomes spanning marine, freshwater, soil and host-associated microbiomes, Rider matches or outperforms leading tools (e.g., LucaProt, PalmScan) and additionally recovers divergent and truncated sequences. Multiple orthogonal indicators, including structure consistency and low DNA read mapping to corresponding contigs, support genuine RNA origin. In a human IBD cohort, Rider agrees with state-of-the-art calls for clinically relevant RNA viruses while extending discovery to divergent lineages. Rider turns structure-guided homology search into a practical, scalable pipeline for RNA virome discovery.
Highlight
A two-stage framework enables structure-guided RNA virus discovery at sample scale, achieving up to 44-fold higher throughput on standard computing hardware.
The method matches or surpasses LucaProt and PalmScan across >10,000 metatranscriptomes from diverse environments, while recovering RdRp fragments missed by existing tools.
Structural validation using ∼200,000 ESMFold-predicted RdRp models and Foldseek alignment supports the detection of remote homologs with high confidence.
Orthogonal evidence, including low DNA read mapping, strand-specific expression, and ORF metrics, confirms RNA origin and reduces false positives..
Open-source code and an openly released RdRp structure database enable scalable, reproducible RNA virome discovery in environmental and clinical settings.