Exploring functional conservation in silico : a new machine learning approach to RNA-editing

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Around 50 years from now, molecular biology opened the path to understand changes in forms, adaptations, complexity, or the basis of human diseases, through myriads of reports on gene birth, gene duplication, gene expression regulation, and splicing regulation, among other relevant mechanisms behind gene function. Here, with the advent of big data and artificial intelligence (AI), we focus on an elusive and intriguing mechanism of gene function regulation, RNA editing, in which a single nucleotide from an RNA molecule is changed with a remarkable impact in the increase of the complexity of transcriptome and proteome. We present a new generation approach to assess the functional conservation of the RNA-editing targeting mechanism using two AI learning algorithms, random forest (RF) and bidirectional long short-term memory (biLSTM) neural networks with attention layer. These algorithms combined with RNA-editing data coming from databases and variant calling from same-individual RNA and DNA-seq experiments from different species, allowed us to predict RNA-editing events using both primary sequence and secondary structure. Then, we devised a method for assessing conservation or divergence in the molecular mechanisms of editing completely in silico : the cross-training analysis. This novel method not only helps to understand the conservation of the editing mechanism through evolution but could set the basis for understanding how it is involved in several human diseases.

Article activity feed

  1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

    Learn more at Review Commons


    Reply to the reviewers

    We thank the reviewers for their valuable comments that we have followed to highly improve our manuscript.

    REVIEWER 1

    Major Comments:

    While the evidence presented supports the application of machine learning in predicting RNA editing events, the paper falls short in justifying its significance within the scope of RNA editing in non-coding regions and Alu repeats, which are typically characterized by low conservation. The paper should provide a more compelling rationale for the method's necessity and potential uses. While it is true that the databases used in mouse and human, as well as the procedures used for the obtention of the mackerel RNA-editing data are rich in Alu repeats and non-coding regions, that is not our focus. We gathered all the available A-to-I editing sites and feed them to our algorithms without distinction. In addition, we are not looking for conservation of the sites themselves yet, but if there is a conservation of the mechanism. This is attempted by assessing the ability of the algorithm trained in one species to predict the editing sites in a different species, a.k.a. cross-training. We already state this in the introduction but we have added an extra sentence in the last paragraph of the introduction.

    A significant limitation of this study is the lack of a thorough comparison with existing methodologies and traditional statistical approaches. Incorporating such analyses would substantially strengthen the validity of the findings.

    We would like the reviewer for pointing this limitation. We have updated the manuscript with a new table in results, and a new discussion segment.

    The descriptions of the machine learning algorithms are insufficiently detailed for replication or thorough comparison. A more comprehensive explanation of the algorithms' parameters and configurations is critical.

    While the main manuscript methods section is short to avoid it to over encumber the manuscript, there is a whole extended methods section with step-by-step instructions to replicate the results, as well as full documentation available in the github at https://github.com/cherrera1990/RNA-editing-pred.

    1. The paper lacks detailed analysis of the prediction accuracy, particularly concerning non-human data and the implications of false positives in unbalanced datasets. A more nuanced interpretation is essential for a comprehensive understanding.

    We have added two discussion segments to address this point. We thank the reviewer for notice this and help us to improve our manuscript.

    The discussion on the evolutionary conservation of RNA editing needs to more explicitly highlight potential practical applications and future research directions. The current treatment of this topic does not offer clear actionable insights.

    While true, we believe that what the reviewer suggests is not the main scope of the paper. We have added and extra sentence at the end to suggests possible doors this work can open.

    Minor Comments:

    The manuscript is marred by grammatical errors and awkward phrasing, including unnecessary references to historical figures like Charles Darwin. A thorough editing and proofreading process would greatly enhance readability. We removed the Charles Darwin reference and proofread the manuscript to correct grammatical errors.

    1. The justification for the selection of statistical tests is unclear, and a more detailed explanation of their relevance to the study's findings would improve the paper's analytical rigor. Incorporating descriptions of the statistical descriptors directly into the main text would remedy this issue.

    We don't exactly know to what the reviewer means with this point. The descriptors used for the random forest are thoroughly described in the extended methods. Besides the tests used for assessing prediction accuracies which are listed in the extended methods section as well as in github, we don't use any other statistical analysis. Nonetheless, we have improved the general methods with an extra paragraph for RF and added reminders of the availability of the extended methods.

    REVIEWER 2

    The main problem of this study is its dependence on computationally predicted RNA secondary structures. To date, algorithms for inferring the secondary structures of polynucleotide chains are affected by considerable errors in several cases. Therefore, there is a high probability that at least part of the training data is largely biased. In this sense it would be appropriate to correlate the performance of the model to that of linearfold used to obtain the secondary structure data. While this is completely true for the RF algorithm and probably the cause of the low accuracy achieved, compared with other methods, that is not the case for the biLSTM algorithm. As we can see in Figure 3 A and Figure 3 B (and Supp. Figure 8 A and 8 B), the accuracy obtained using sequence alone is almost identical to the one obtained using both channels, while the accuracy obtained using just secondary structure is noticeably lower. This most probably means that the biLSTM algorithm is just ignoring the secondary structure channel, so no bias is being introduced in the training dataset.

    Furthermore, it is known that bi-LSTMs trained on large datasets tend to be affected by catastrophic forgetting, therefore it should be evaluated to what extent the performances can be improved by expanding the dataset.

    While true, this can be deal with an attention layer such as the one we use. In addition, we can see (Supp. Figure 5) how the mackerel prediction accuracy decrease when we reduce the database size. This can be marginally observed in human as well.

    It is also notable an inconsistency between the performance summary table and the confusion matrices to which it refers.

    We have corrected Figure 6 showing the proper percentages (the confusion matrices were correct) as well as reordered Supp. Figure 3 in order to be more similar to the Table 2.

    In the end the 3' enrichment of guanosines, which is the typical of the consensus recognized by the ADARs, does not appear to emerge from the sequence logo relating to the training data.

    We did notice this, and while we had already a small comment in the discussion, we expanded it further.

    Point-by-point description of the revisions

    __ Figures and Tables__

    • Figure 6 has been corrected with the proper accuracies.

    • Supp. Figure 3 has been reordered to mirror the Table 2 design.

    • Table 1 has been renamed to Table 2.

    • A new table has been added as Table 1 with other analysis of RNA-editing predictions by machine learning.


    __ Introduction__

    • Charles Darwin reference has been removed (L11).

    • "independently of the conservation of editing sites" added to last paragraph (L117).

    __ Results__

    • New section "Benchmarking the algorithms with previous RNA-editing prediction attempts based on machine learning" added including a new table as Table 1 (L170-178).

    __ Discussion__

    • "Random forest" section expanded at the end (L254-258).

    • "biLSTM algorithm" section expanded at the end of paragraph 1 and paragraph 2 (L274-280; L289-295).

    • "Differences in accuracy between human and non-human data" section expanded at the end (L313-316).

    • Additional sentence added at the end of "Cross-training and mechanism conservation" section (L353-355).

    __ Methods__ __- __Reminders of availability of extended methods added at the end of "Origin of the RNA-editing and genomic data", "General pipeline for constructing the Random Forest and Neural networks datasets", and "biLSTM" sections (L375; L390; L429-430).

    • Extra paragraph added for "RF" section (L408-413).

    __ Proofreading and correction of typos__

  2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #2

    Evidence, reproducibility and clarity

    This study describes Deep Learning applications aimed at identifying edited sites in different organisms. The method is able, starting from the knowledge of the transcriptome of one organism, to predict RNA editing in another, exploiting the functional conservation of ADAR enzymes throughout the animal kingdom. This study concludes that this approach, within certain limits, is a feasible option and worthy of further development.

    The main problem of this study is its dependence on computationally predicted RNA secondary structures. To date, algorithms for inferring the secondary structures of polynucleotide chains are affected by considerable errors in several cases. Therefore there is a high probability that at least part of the training data is largely biased. In this sense it would be appropriate to correlate the performance of the model to that of linearfold used to obtain the secondary structure data. Furthermore, it is known that bi-LSTMs trained on large datasets tend to be affected by catastrophic forgetting, therefore it should be evaluated to what extent the performances can be improved by expanding the dataset. It is also notable an inconsistency between the performance summary table and the confusion matrices to which it refers. In the end the 3' enrichment of guanosines, which is the typical of the consensus recognized by the ADARs, does not appear to emerge from the sequence logo relating to the training data.

    Significance

    Advance: compare the study to existing published knowledge: does it fil a gap? what kind of advance does it make (conceptual, fundamental, methodological, incremental, ...) The study, although the critical remarks addressed above represents a conceptual advancement

    Audience: which communities will be interested in/influenced, what kind of audience (broad, specialised, clinical, basic research, applied sciences, fields and subfields, ...) This contributions targets a specialised audience even if the potential applications are broad

    Describe your expertise

    Comparative Genomics and Bioinformatics

  3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #1

    Evidence, reproducibility and clarity

    Summary: This manuscript presents an approach for assessing the conservation of RNA editing, with a particular focus on non-coding regions and Alu repeats, using machine learning techniques. The goal is to forecast RNA editing occurrences and their evolutionary conservation across different species. However, the paper does not convincingly argue for the importance or the necessity of this method, especially considering the anticipated low conservation levels in the targeted regions.

    Major Comments:

    1. While the evidence presented supports the application of machine learning in predicting RNA editing events, the paper falls short in justifying its significance within the scope of RNA editing in non-coding regions and Alu repeats, which are typically characterized by low conservation. The paper should provide a more compelling rationale for the method's necessity and potential uses.
    2. A significant limitation of this study is the lack of a thorough comparison with existing methodologies and traditional statistical approaches. Incorporating such analyses would substantially strengthen the validity of the findings.
    3. The descriptions of the machine learning algorithms are insufficiently detailed for replication or thorough comparison. A more comprehensive explanation of the algorithms' parameters and configurations is critical.
    4. The paper lacks detailed analysis of the prediction accuracy, particularly concerning non-human data and the implications of false positives in unbalanced datasets. A more nuanced interpretation is essential for a comprehensive understanding.
    5. The discussion on the evolutionary conservation of RNA editing needs to more explicitly highlight potential practical applications and future research directions. The current treatment of this topic does not offer clear actionable insights.

    Minor Comments:

    1. The manuscript is marred by grammatical errors and awkward phrasing, including unnecessary references to historical figures like Charles Darwin. A thorough editing and proofreading process would greatly enhance readability.
    2. The justification for the selection of statistical tests is unclear, and a more detailed explanation of their relevance to the study's findings would improve the paper's analytical rigor. Incorporating descriptions of the statistical descriptors directly into the main text would remedy this issue.

    Significance

    Summary: The manuscript introduces a method to explore the functional conservation of RNA editing. However, it does not adequately justify its significance or practical applicability, particularly in the context of non-coding regions characterized by low conservation. The lack of comparative analysis with existing methods and detailed machine learning methodology explanations detracts from its potential impact. Addressing these issues would greatly enhance the paper's contribution to the scientific community.

    General Assessment: The cornerstone of this study is its approach towards the prediction and evolutionary conservation analysis of RNA-editing events using machine learning techniques. Despite these technical achievements, the study falls short in adequately highlighting the biological significance of RNA editing within non-coding regions and Alu repeats. Additionally, the absence of a comprehensive comparative analysis with pre-existing methods and the lack of detailed algorithmic descriptions somewhat diminish the study's potential influence and applicability in the wider scientific domain. Moreover, there are grammatical errors and awkward phrasings that disrupt the flow of the text (e.g. why are we talking about Charles Darwin?) Please just focus on the method and RNA editing improve the overall readability of the paper!

    Advance: The research notably progresses the field of genomics by harnessing machine learning to investigate RNA editing prediction and conservation, a subject not thoroughly examined in existing literature. Its innovative utilization of advanced computational models sets a new precedent, offering fresh perspectives on the mechanisms of RNA editing and their evolutionary contexts. This study enriches our understanding of genomics by illustrating the applicability of machine learning in unraveling the complexities of biological phenomena, such as RNA editing, thereby expanding the frontier of knowledge in both theoretical and practical aspects of genomics research.

    Audience: A niche audience comprising bioinformatics experts focused on RNA editing, computational biology, and evolutionary genetics.

    My proficiency centers on human genomics, RNA editing biology, and computational methodologies.