Towards accurate, reference-free differential expression: A comprehensive evaluation of long-read de novo transcriptome assembly

Feng Yan
Pedro L. Baldoni
James Lancaster
Matthew E. Ritchie
Mathew G. Lewsey
Quentin Gouil
Nadia M. Davidson

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Long-read RNA sequencing has significantly advanced transcriptomics by enabling the full length of transcripts to be assessed. However, current analysis methods often depend on a high-quality reference genome and gene annotation. Recently, de novo assembly methods have been developed to utilise long-read data in cases where a reference genome is unavailable, such as in non-model organisms. Despite the potential of these tools, there remains a lack of benchmarking and established protocols for optimal reference-free, long-read transcriptome assembly and differential expression analysis.

Here, we comprehensively evaluate the current state-of-the-art long-read de novo transcriptome assembly tools, RATTLE, RNA-Bloom2 and isONform, and compare their performance to one of the leading short-read assemblers, Trinity. We assess various metrics, including assembly quality and computational efficiency, across a range of datasets, which include simulated data and spike-in sequin transcripts, where ground truth is known, and real data from human and pea ( Pisum sativum ) samples, using a reference-based approach to define truth. To represent contemporary analysis scenarios, the datasets cover depths from 6 million to 60 million reads, Oxford Nanopore Technologies (ONT) cDNA, ONT direct RNA and Pacific Biosciences (PacBio) 10x single-cell sequencing. Critically, we assessed the downstream impact of assembly choice on the detection of differential gene and transcript expression.

Our results confirm that long reads generate longer assembled transcripts than short-reads for reference-free analysis, though limitations remain compared to reference-guided approaches, and suggest scope for improved accuracy and reduced redundancy. Of the de novo pipelines, RNA-Bloom2, coupled with Corset for transcript clustering, was the best performing in terms of both accuracy and computational efficiency. Our findings offer guidance when selecting the most effective strategy for long-read differential expression analysis, when a high-quality reference genome is unavailable.

Version published to 10.1101/2025.02.02.635999 on bioRxiv
Feb 7, 2025

Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world

This article has 7 authors:
1. Grazia Visci
2. Elisabetta Notario
3. Giuseppe Defazio
4. Mariano Francesco Caratozzolo
5. Bruno Fosso
6. Marinella Marzano
7. Graziano Pesole
This article has no evaluationsLatest version Jan 30, 2026
A Benchmarking Framework to Catalyze Individual Human Genome Projects

This article has 3 authors:
1. Manjushri kalpande
2. Apoorva Ganesh
3. Subhashini Srinivasan
This article has no evaluationsLatest version Dec 17, 2025
META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

This article has 8 authors:
1. Louis-Maël Guéguen
2. Alban Mathieu
3. Simon Pelletier
4. Anthony Woo
5. Namita Misra
6. Magali Moreau
7. Olivier Perin
8. Arnaud Droit
This article has no evaluationsLatest version Jan 29, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Shotgun metagenomics: a deep insight into the composition and function of the complex microbial world

A Benchmarking Framework to Catalyze Individual Human Genome Projects

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing