Towards accurate, reference-free differential expression: A comprehensive evaluation of long-read de novo transcriptome assembly
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Long-read RNA sequencing has significantly advanced transcriptomics by enabling the full length of transcripts to be assessed. However, current analysis methods often depend on a high-quality reference genome and gene annotation. Recently, de novo assembly methods have been developed to utilise long-read data in cases where a reference genome is unavailable, such as in non-model organisms. Despite the potential of these tools, there remains a lack of benchmarking and established protocols for optimal reference-free, long-read transcriptome assembly and differential expression analysis.
Here, we comprehensively evaluate the current state-of-the-art long-read de novo transcriptome assembly tools, RATTLE, RNA-Bloom2 and isONform, and compare their performance to one of the leading short-read assemblers, Trinity. We assess various metrics, including assembly quality and computational efficiency, across a range of datasets, which include simulated data and spike-in sequin transcripts, where ground truth is known, and real data from human cell lines and pea ( Pisum sativum ) samples, using the reference-guided assembler Bambu to define truth. To represent contemporary analysis scenarios, the datasets cover depths from 6 million to 60 million reads, Oxford Nanopore Technologies (ONT) cDNA and direct RNA sequencing, and critically, we assessed the downstream impact of assembly choice on the detection of differential gene and transcript expression.
Our results confirm that long reads generate longer assembled transcripts than short-reads for reference-free analysis, though limitations remain compared to reference-guided approaches, and suggest scope for improved accuracy and redundancy. Of the de novo pipelines, RNA-Bloom2, coupled with Corset for transcript clustering, was the best performing in both accuracy and computational efficiency. Our findings offer guidance when selecting the most effective strategy for long-read differential expression analysis, when a high-quality reference genome is unavailable.