EvoRMD: Integrating Biological Context and Evolutionary RNA Language Models for Interpretable Prediction of RNA Modifications
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
RNA modifications are essential regulators of post-transcriptional gene expression, influencing RNA stability, localization, translation, and degradation. Determining the specific modification at a given nucleotide is therefore critical for understanding its regulatory role. Most computational approaches treat each modification type as an independent binary task. This strategy provides a macro-level statistical perspective, but it does not reflect that, under a defined biochemical or cellular condition, only one modification type can occur at a specific site. Current mapping assays also report a single observed modification per site, leaving all other types unlabeled rather than truly negative. These properties motivate a framework that can reason over competing modification types. We introduce EvoRMD , a unified model for biologically contextualized and interpretable prediction of RNA modification types. EvoRMD integrates contextual sequence embeddings from a large-scale RNA language model with structured biological metadata—including species, organ, cell type, and subcellular localization. A lightweight attention mechanism highlights informative sequence positions. A shared multi-class classifier then generates a context-conditioned plausibility distribution over eleven modification types (Am, Cm, Um, Gm, D, pseudouridine, m 1 A, m 5 C, m 5 U, m 6 A, m 7 G), consistent with the single-positive, multiple-unlabeled nature of existing datasets. Although trained in a multi-class setting, EvoRMD also produces calibrated multi-label predictions through sigmoid-transformed logits, enabling direct comparison with existing single-modification and multi-label methods. EvoRMD achieves strong predictive performance and offers interpretable insights through attention profiles and motif analyses. Together, these components establish a biologically grounded framework for identifying and prioritizing RNA modification types from sequence and context.