Systematic benchmarking of small variant calling pipelines for long-read RNA sequencing data

Jiayi Wang
Mark D. Robinson

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Long-read RNA sequencing (lrRNA-seq) enables transcript-resolved variant detection, but systematic and neutral evaluations of small variants calling pipelines remain limited. The performance of existing tools across sequencing technologies, alignment strategy, variant caller choice, genomic contexts and downstream haplotype phasing is not fully understood.

Results

Here, we systematically benchmark four lrRNA-seq variant callers (Clair3-RNA, DeepVariant, longcallR, and longcallR-nn), along with a widely used short-read RNA-seq variant caller (GATK HaplotypeCaller) as a baseline, using Genome in a Bottle (GIAB) datasets comprising three cell lines sequenced with four Oxford Nanopore Technologies (ONT) and two PacBio library preparation protocols. We further evaluate the impact of upstream alignment strategies, including aligner choice and alignment transformation, on variant-calling performance. Accuracy is assessed across sequencing depths and genomic contexts. Additionally, we compare haplotype phasing tools (WhatsHap, LongPhase, HapCUT2, HiPhase and longcallR) using variant calls generated by different callers to identify optimal pipeline combinations. Finally, we extend our evaluation of variant-calling performance to more recent LongBench datasets.

Conclusions

Our benchmark shows that sequencing quality is the primary determinant of lrRNA-seq variant-calling performance, followed by variant caller and alignment strategy, with additional effects from genomic context. In GIAB datasets, all lrRNA-seq-specific callers performed reasonably well, with Clair3-RNA (across both ONT and PacBio) and DeepVariant (for PacBio) ranking among the top-performing methods. In more recent LongBench datasets of cancer cell lines, DeepVariant and longcallR showed higher sensitivity, whereas Clair3-RNA and longcallR-nn were more conservative, yielding fewer variant calls. For downstream haplotype phasing, we recommend WhatsHap or HapCUT2 for most libraries, owing to their high phasing coverage and accuracy, respectively, while longcallR performs better on ONT dRNA004 datasets across both metrics.

Version published to 10.64898/2026.04.29.721619 on bioRxiv
May 2, 2026

Benchmarking long-read RNA-seq across modalities, methods, and sequencing depth in iNeurons

This article has 17 authors:
1. Gianfranco Botta
2. David Wissel
3. Samuel Higgins
4. Tomasz Chelmicki
5. Alexander Popescu
6. Kalina Radoynovska
7. Seraphin Probst
8. Julieta Ramírez Cuéllar
9. Kim Schneider
10. Zeynab-Mitra Nayernia
11. Ashley Byrne
12. Christopher D. Nelson
13. Zora Modrusan
14. Stormy J. Chamberlain
15. William Stephenson
16. Mark D. Robinson
17. Rajib Schubert
This article has no evaluationsLatest version Apr 4, 2026
LoRTIA Plus: a chemistry-agnostic, feature-first software package for long-read transcriptome annotation

This article has 5 authors:
1. Gábor Torma
2. Zsolt Balázs
3. Ádám Fülöp
4. Dóra Tombácz
5. Zsolt Boldogkői
This article has no evaluationsLatest version Apr 4, 2026
Unraveling the potential of short and long read sequencing for human genome profiling

This article has 15 authors:
1. Aurélie Leduc
2. Asmae Bachr
3. Florian Sandron
4. Marc Delépine
5. Damien Delafoy
6. Cédric Fund
7. Christian Daviaud
8. Stéphane Meslage
9. Violette Turon
10. Delphine Bacq-Daian
11. Francis Rousseau
12. Robert Olaso
13. Jean-François Deleuze
14. Zuzana Gerber
15. Vincent Meyer
This article has no evaluationsLatest version Apr 22, 2026

Discuss this preprint

Listed in

Abstract

Background

Results

Conclusions

Article activity feed

Related articles

Benchmarking long-read RNA-seq across modalities, methods, and sequencing depth in iNeurons

LoRTIA Plus: a chemistry-agnostic, feature-first software package for long-read transcriptome annotation

Unraveling the potential of short and long read sequencing for human genome profiling