Deep Learning for RNA Secondary Structure Determination: Gauging Generalizability and Broadening the Scope of Traditional Methods
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The diverse regulatory functions, protein production capacity, and stability of natural and synthetic RNAs are closely tied to their ability to fold into intricate structures. Determining RNA structure is thus fundamental to RNA biology and bioengineering. Among existing approaches to structure determination, computational secondary structure prediction offers a rapid and low-cost strategy and is thus widely used, especially when seeking to identify functional RNA elements in large transcriptomes or screen massive libraries of novel designs. While traditional approaches rely on detailed measurements of folding energetics and/or probabilistic modeling of structural data, recent years have witnessed a surge in deep learning methods, inspired by their tremendous success in protein structure prediction. However, the limited diversity and volume of known RNA structures can impede their ability to accurately predict structures markedly different from the ones they have seen. This is known as the generalization gap and currently poses a major barrier to progress in the field. In this Perspective article, we gauge method generalizability using a new benchmark dataset of structured RNAs we curated from the Protein Data Bank. We also discuss the emergence of deep learning methods for predicting structure probing data and use a new dataset to underscore generalization challenges unique to this domain along with directions for future improvement. Expanding beyond improving predictive accuracy, we review how advances in deep learning have recently enabled scalable and accessible optimization of traditional structure prediction methods and their seamless integration with modern neural networks.