AI methods and biologically informed data curation enable accurate RNA m 5 C prediction
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
5-methylcytosine (m 5 C) RNA modifications influence nearly every aspect of RNA metabolism, but their transcriptome wide detection is limited by costly, error-prone assays. To bridge this experimental gap, a wave of AI tools now predicts putative m 5 C sites in silico . However, most existing approaches prioritize architectural complexity while neglecting data quality, so their reported gains mainly reflect the artifacts inherited from noisy datasets. We inverted this paradigm by constructing a high-confidence, methyltransferase-specific catalog of m 5 C sites, removing artifacts that confound existing resources. Using this curated corpus, we trained (for the first time in a multiclass setting) three different models (Bi-GRU, CNN, Transformer) to distinguish writer-specific m 5 C sites from unmethylated cytosines. All AI models converged to similar, nearly optimal, performance (AUPRC > 0.97), and a biologically informed analysis revealed that most errors clustered in unmethylated sites mimicking true positives. By augmenting the training set with these hard-to-predict negatives, mined from millions of unmodified cytosines, the models were forced to exploit more nuanced features such as RNA secondary structure and subtle sequence cues, which sharply reduced transcriptome-wide false positive predictions, and predicted methylated transcripts exhibited strong concordance with known methyltransferase biology. Explainable AI techniques also showed that our AI models effectively capture how sequence mutations disrupt m 5 C sites, underscoring their potential to prioritize disease-relevant variants. The main findings of our study underscore that AI models can be decisive levers for reliable m 5 C identification only if fed with curated data and validated through biologically informed computational analysis.