SOORENA: Self-lOOp containing or autoREgulatory Nodes in biological network Analysis
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Autoregulatory mechanisms, in which proteins modify their own activity or expression, are fundamental components of biological regulatory systems but remain challenging to identify systematically within the scientific literature. Manual curation is outpaced by publication growth, with self-regulation often described implicitly. To address the lack of automated tools for identifying protein autoregulatory mechanisms, we present SOORENA, a two-stage transformer-based model designed to predict and classify such mechanisms within PubMed abstracts. In Stage 1, the model determines whether a publication describes any form of protein autoregulation. In Stage 2, positive instances are further classified into one of seven mechanistic categories: autophosphorylation, autoubiquitination, autocatalytic activity, autoinhibition, autolysis, autoinducer production, and autoregulation. SOORENA was fine-tuned from PubMedBERT using a curated dataset of 1,332 experimentally validated abstracts sourced from UniProt-referenced publications. On a held-out test set, Stage 1 achieved an accuracy of 96.0% and a precision of 97.8%, effectively minimizing false positive propagation. Stage 2 demonstrated robust performance across all classes, with an overall accuracy of 95.5% and a macro-F1 score of 96.2%, including perfect classification for the two least-represented categories. Error analysis revealed that most misclassifications occurred between mechanistically related categories, suggesting that the model’s learned representations reflect underlying biological relationships. We deployed SOORENA as a Shiny app enabling interactive search, metadata-based filtering, and ranking of predictions by model confidence alongside standardized ontology definitions to support scientific exploration. These results demonstrate that domain-specific language models can scale the discovery and curation of biologically critical self-regulatory mechanisms.