Benchmarking siRNA Prediction: The Role of Representation and Validation Strategies
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Small interfering RNAs (siRNAs) offer transformative potential for targeted therapeutics, yet the design of highly effective and non-toxic candidates is hindered by the risk of off-target effects and RNA instability. A critical flaw in in silico prediction models is pervasive data leakage in cross-validation protocols, which artificially inflates performance metrics and produces untrustworthy results. To address this, we developed a rigorous framework that eliminates data leakage through strict cross-validation, leverages z-curves (3D representations of RNA physico-chemical properties) for context-aware sequence encoding, and identifies key sequence regions critical for efficacy. Our model achieves an AUC of 0.845 on leakage-free validation, surpassing prior work at 380x faster computation speed, demonstrating that superior representation trumps model complexity. Crucially, we demonstrate how experimental variability and cross-validation choices directly impact model reliability, establishing the first benchmarked methods for robust siRNA efficacy prediction. This work provides a foundation for trustworthy sequence design and validation in RNA therapeutics.