BrumiR: A toolkit for de novo discovery of microRNAs from sRNA-seq data

Carol Moraga
Evelyn Sanchez
Mariana Galvão Ferrarini
Rodrigo A. Gutierrez
Elena A. Vidal
Marie-France Sagot

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (GigaScience)

Abstract

MicroRNAs (miRNAs) are small non-coding RNAs that are key players in the regulation of gene expression. In the last decade, with the increasing accessibility of high-throughput sequencing technologies, different methods have been developed to identify miRNAs, most of which rely on pre-existing reference genomes. However, when a reference genome is absent or is not of high quality, such identification becomes more difficult. In this context, we developed BrumiR, an algorithm that is able to discover miRNAs directly and exclusively from sRNA-seq data. We benchmarked BrumiR with datasets encompassing animal and plant species using real and simulated sRNA-seq experiments. The results demonstrate that BrumiR reaches the highest recall for miRNA discovery, while at the same time being much faster and more efficient than the state-of-the-art tools evaluated. The latter allows BrumiR to analyze a large number of sRNA-seq experiments, from plants or animals species. Moreover, BrumiR detects additional information regarding other expressed sequences (sRNAs, isomiRs, etc.), thus maximizing the biological insight gained from sRNA-seq experiments. Finally, when a reference genome is available, BrumiR provides a new mapping tool (BrumiR2ref) that performs an a posteriori exhaustive search to identify the precursor sequences. The code of BrumiR is freely available at https://github.com/camoragaq/BrumiR .

GigaScience
Feb 17, 2023

AbstractMicroRNAs (miRNAs)

This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac093 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer name: Ernesto Picardi

The manuscript by Moraga et al. describes BrumiR, a software devoted to the de novo identification of miRNAs from deep sequencing experiments of the RNA fraction at low molecular weight. In contrast with existing tools, BrumiR is based on de Bruijn graphs, generated directly from raw fastq reads. The performances on simulated and real sequencing data, in terms of precision, recall and FScore, are very good. In addition, the tool is ultra-fast, enabling the analysis of huge amount of data. I have tried to use BrumiR but I always got a GLIB error. I have tested the …

AbstractMicroRNAs (miRNAs)

This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac093 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer name: Ernesto Picardi

The manuscript by Moraga et al. describes BrumiR, a software devoted to the de novo identification of miRNAs from deep sequencing experiments of the RNA fraction at low molecular weight. In contrast with existing tools, BrumiR is based on de Bruijn graphs, generated directly from raw fastq reads. The performances on simulated and real sequencing data, in terms of precision, recall and FScore, are very good. In addition, the tool is ultra-fast, enabling the analysis of huge amount of data. I have tried to use BrumiR but I always got a GLIB error. I have tested the script on different Linux and Mac computers but I was not able to fix the GLIB error. It seems that a very recent version of the GLIB library is required. So, unfortunately, I didn't have the possibility to test the program and look at the outputs.

Major concerns:

I was not able to run the program and, thus, provide a correct revision. In my opinion, the github page should take into account this by providing the minimal software and hardware architecture to run BrumiR. Authors could also include a copy of the output files (by the way, there is a typo in the description of the second output file).

Since the tools is able to identify novel miRNAs and look also at known ones, they could provide an output file including the read count per miRNA. In addition, since the tool is expected to be ultra-fast (not checked … see above), the differential gene expression analysis could also be implemented.

I suggest also to implement a graphical output. A sort of summary in a decorated html page.

By using BrumiR, authors analyze miRNAs in Arabidopsis during the development, discovering three novel miRNAs. Although bioinformatics evidences indicate that they could be real miRNAs, an experimental validation is required. Indeed, these miRNAs have been detected by BrumiR only. I think that this validation could be easily done because authors directly performed sRNAseq data. In my opinion, this experiment could really improve the manuscript and assess the high performance of BrumiR.

Read the original source
GigaScience
Feb 17, 2023

AbstractMicroRNAs

This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac093 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer name: Marc Friedlander

The authors here present BrumiR, a de Bruijn-based method to discover miRNAs independently of a reference genome. Today most miRNA discovery and annotation is done by mapping sequenced RNAs to readily available reference genomes and analyzing the mapping profiles. However, there are some uses cases where the genome-free approach is needed (particularly for species that have no reference genome or where the genomes have missing parts); therefore BrumiR could potentially be useful for the community. However, the comparison to existing tools needs to be done in a more …

AbstractMicroRNAs

This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac093 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer name: Marc Friedlander

The authors here present BrumiR, a de Bruijn-based method to discover miRNAs independently of a reference genome. Today most miRNA discovery and annotation is done by mapping sequenced RNAs to readily available reference genomes and analyzing the mapping profiles. However, there are some uses cases where the genome-free approach is needed (particularly for species that have no reference genome or where the genomes have missing parts); therefore BrumiR could potentially be useful for the community. However, the comparison to existing tools needs to be done in a more careful way.

Major comments:

RFAM filtering is not really part of the prediction step, this is rather a filtering step. Therefore, to make a fair comparison with mirnovo (the other genome-free tool), BrumiR should additionally be run without RFAM filtering, and mirnovo should additionally be run using the exact same RFAM filtering.

it appears that 16-mers from miRBase miRNAs were specifically excluded from the RFAM catalog used for the filtering, which is reasonable. However, the miRNAs from the exact benchmarked species should not be included in the used miRBase 16-mer catalog, to avoid circular reasoning.

miRDeep2 software should ideally not be run with default options - this is particular important since the miRDeep2 performance in this manuscript appears lower than what is reported in other studies (e.g. Friedlander et al. 2012). First, reference mature miRNAs from a related and well-annotated species should be included to support the prediction. Second, a score cut-off should be used that gives a decent signal-to-noise ratio according to the miRDeep2 output overview table (for instance 5:1). Third, all read pre-processing and genome mapping should be performed with the mapper.pl script which is part of the miRDeep2 package.

it appears that only miRNA-derived sequences were included in the simulated data. In fact, real small RNA-seq data typically contains fragments from other known types of RNA and also sequences from unannotated parts from the genome. Therefore, the authors should use simulated data that also includes samples from RFAM and randomly sampled sequences from the reference genome (for instance 10% of each). Overall, the use of simulated sequence data could be put a bit in the background in this study, since real small RNA-seq data is in fact readily available these days and typically has a structure that is not easy to simulate. Further, there is little reason not to use real data, since the miRNAs in miRBase tend to be reasonably well curated for most species and therefore can function well as a gold standard for benchmarking.

precision of BrumiR is in some cases lower than 0.2, for instance for one mouse dataset. From this dataset ~3000 mouse miRNAs are reported - the majority of which are not in miRBase and can reasonably be presumed to be false positives. The authors should comment on why this particular dataset appears to produce so many false positives for BrumiR - could this have to do with the prevalence of piRNAs that the software cannot easily discern from miRNAs? Also, the authors should reflect on in what kind of use cases could tolerate these thousands of false positives. Would this be for generating candidates for downstream high-throughput validation?

the authors should either benchmark BrumiR against the genome-free methods miReader and MirPlex, or explain why this comparison is not relevant.

Minor comments:

the brief introduction to miRNA biology should be carefully edited by an expert in the field. Currently, very old reviews are being cited (e.g. Bartel 2004), and some of the other references appear to be a bit spurious (e.g. why focus on plant host-pathogen interactions out of the hundreds of established functions of miRNAs?). The excellent review of Dave Bartel from 2018 contains references to numerous milestone studies that the introduction could build on.

the authors write on page 2 that genome-based methods struggle with a high rate of false positive prediction, citing [9]. However, this is a mis-reference, since the reference [9] states that methods that rely on only the genome and do not leverage on small RNA-seq data have high false positive rates.

Read the original source
GigaScience
Feb 17, 2023

Abstract

This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac093 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer name: Dadi Gao

Summary: The authors developed a de novo assembly method, BrumiR, for small RNA sequencing data based on de Bruijin graph algorithm. This tool displayed a relatively high sensitivity in finding miRNAs and helped the authors discover a novel miRNA in A. thaliana roots.

Major comments:

Have the authors compare the performance with different seed length? Even if the minimal miR length is 18nt in MiRBase 21, seed=18 might not necessarily lead to the best AUC or F score (This might also be related to Comment 4).

The authors need to benchmark BrumiR with more existing tools (e.g. those …

Abstract

This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac093 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer name: Dadi Gao

Summary: The authors developed a de novo assembly method, BrumiR, for small RNA sequencing data based on de Bruijin graph algorithm. This tool displayed a relatively high sensitivity in finding miRNAs and helped the authors discover a novel miRNA in A. thaliana roots.

Major comments:

Have the authors compare the performance with different seed length? Even if the minimal miR length is 18nt in MiRBase 21, seed=18 might not necessarily lead to the best AUC or F score (This might also be related to Comment 4).

The authors need to benchmark BrumiR with more existing tools (e.g. those ML-based methods), and to include more genome-free methods (e.g. MiRNAgFree).

It is also interesting to know whether de novo method for mRNA assembly would be useful on the miRNA side. It would be great if the authors were able to compare the performance of BrumiR2reference (without filtering for RFAM) with Trinity in genome-guided mode, by tweaking its seed length to be the same as BrumiR.

The tool's sensitivity is promising across animal and plant datasets. However, the average precision is quite low, an average precision of 0.3 means a false discovery rate of 0.7. This is not an accepted value for a tool designed to discover novel miRNA. Is there any parameter the author could tweak towards a better performance? For example, is seed length of 18nt too short to start with? Is there any other sequences feature the authors should take into account to boost the performance? Or maybe some post-assembly filtering approaches might be sufficient and helpful.

Wet-lab validation (e.g. Luciferase assay) for the identified novel miRs will leverage the real-life usefulness of BrumiR. This is extremely important, as the tool showed a high false discovery rate.

Minor comments:

MiRNA maturation involves RNA editing. Can the authors comment on how this would be handled and captured by BrumiR. It seems that the authors allow mismatches when cluster the potential miRNAs via edlib library. It is interesting to know whether or not, or to what extent, edlib would help in including RNA edited candidates in the final result.

Read the original source
GigaScience
Feb 17, 2023

Read the original source
Version published to 10.1101/2020.08.07.240689 on bioRxiv
Aug 7, 2020

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed