AI methods and biologically informed data curation enable accurate RNA m 5 C prediction

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

5-methylcytosine (m 5 C) RNA modifications influence nearly every aspect of RNA metabolism, but their transcriptome wide detection is limited by costly, error-prone assays. To bridge this experimental gap, a wave of AI tools now predicts putative m 5 C sites in silico . However, most existing approaches prioritize architectural complexity while neglecting data quality, so their reported gains mainly reflect the artifacts inherited from noisy datasets. We inverted this paradigm by constructing a high-confidence, methyltransferase-specific catalog of m 5 C sites, removing artifacts that confound existing resources. Using this curated corpus, we trained (for the first time in a multiclass setting) three different models (Bi-GRU, CNN, Transformer) to distinguish writer-specific m 5 C sites from unmethylated cytosines. All AI models converged to similar, nearly optimal, performance (AUPRC > 0.97), and a biologically informed analysis revealed that most errors clustered in unmethylated sites mimicking true positives. By augmenting the training set with these hard-to-predict negatives, mined from millions of unmodified cytosines, the models were forced to exploit more nuanced features such as RNA secondary structure and subtle sequence cues, which sharply reduced transcriptome-wide false positive predictions, and predicted methylated transcripts exhibited strong concordance with known methyltransferase biology. Explainable AI techniques also showed that our AI models effectively capture how sequence mutations disrupt m 5 C sites, underscoring their potential to prioritize disease-relevant variants. The main findings of our study underscore that AI models can be decisive levers for reliable m 5 C identification only if fed with curated data and validated through biologically informed computational analysis.

Article activity feed