Comparative Analysis of De Novo Assemblers and Quantification Software for RNA-sequencing Data in Non-Model Arthropods

Marie V. Brasseur
Florian Leese
Christoph Mayer

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

RNA-sequencing has greatly improved our understanding of the transcriptomic regulation of fundamental biological processes. Although the method has matured significantly within the last decade, bioinformatic processing of the resulting high-dimensional data sets is still challenging and the performance of algorithms can vary between data sets. As a consequence, for most non-model organisms, in particular arthropods, there is no or limited literature evidence which software is best suited to handle taxon-specific data characteristics. Therefore, we evaluated the performance of different de nonvo transcriptome assembler (Trinity, rnaSPAdes, IDBA-tran) and transcript quantification software (RSEM, Salmon) on transcriptomic data of a non-model insect and freshwater crustacean species, as well as the impact of different quality trimming strategies on the downstream bioinformatic processing results.

Results

While the trimming strategy had no considerable effect on the quality of transcriptome assemblies, the choice of the assembler had a substantial impact. IDBA-tran was less sensitive than the two other assemblers and produced the most fragmented transcriptome assemblies. The low remapping rates of reads against IDBA-tran assemblies further suggest that the input read data was not effectively leveraged by this algorithm. In contrast, Trinity and rnaSPAdes both generated comprehensive and contiguous de novo transcriptome assemblies, although Trinity appeared to be slightly more sensitive. This increased sensitivity, however, was associated with a higher redundancy in Trinity-generated assemblies compared to assemblies produced with rnaSPAdes. When the quality of the transcriptome assembly was high, RSEM and Salmon were able to identify the origin of at least 90% of the read data in the reference. Despite their different underlying quantification approaches, the estimated transcript counts of both tools were highly correlated and their expression signal was consistent. Notably, the alignment-free quantification algorithm Salmon was substantially faster than the alignment-based approach of RSEM. Furthermore, it was also slightly more sensitive, increasing the average re-mapping rate to ∼98%.

Conclusion

Since the performance of bioinformatic algorithms, especially of de novo assemblers, varies for different RNA-sequencing data sets, establishing an appropriate analysis workflow remains an important task. Our results show that the better performing combinations of algorithms produce congruent count data sets with consistent expression signal, highlighting the robustness of RNA-sequencing data analysis software.

Version published to 10.1101/2025.08.01.668104 on bioRxiv
Aug 3, 2025