The battle for reads: evaluating strategies to tackle multi-mapping in RNA-seq quantification in highly repetitive genomes

Aldana A Cepeda Dean
Virginia Balouz
Carlos A Buscaglia
Natalia Rego
Luisa Berná

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: RNA sequencing (RNA-seq) enables transcript quantification and isoform analysis in diverse biological contexts, but accurately measuring expression from highly related genomic regions remains challenging. Multi-mapped reads—those aligning equally well to multiple loci—pose a major computational hurdle and compromise the overall accuracy of transcriptome resolution. Results: We herein evaluated five RNA-seq pipelines—Bowtie2 + featureCounts, STAR + featureCounts, STAR + Salmon, Salmon, and Kallisto—on their ability to quantify gene expression in Trypanosoma cruzi , a parasitic protozoan with a highly repetitive genome characterized by the abundance of large multigene families.Using real RNA-seq data, we first compared gene-level outputs, with emphasis on multigene family representation. Simulated transcriptomes were used to benchmark quantification accuracy under controlled conditions. Among the best-performing strategies (Salmon, Kallisto, and STAR + Salmon), we further tested whether including untranslated regions (UTRs) in gene annotations improved the assignment of ambiguous reads. Conclusions: Overall, the alignment-free transcriptome quantifiers Salmon and Kallisto achieved the most accurate performance, closely matching simulated values. Incorporating UTR annotations improved read assignment accuracy, particularly for STAR + Salmon. These tools not only enable global expression quantification but also facilitate precise read allocation between members of the same gene family, with up to 98% sequence identity. Our results highlight the critical role of annotation quality and quantification strategy in improving gene expression estimates.

Version published to 10.21203/rs.3.rs-7888056/v1 on Research Square
Nov 6, 2025

Optimizing bioinformatic workflows to extract clinically usable gene expression data from targeted RNA sequencing panels: comparison with total RNAseq

This article has 12 authors:
1. Xiaokang Pan
2. Ashley Patton
3. Yi Seok Chang
4. Ryan Stevens
5. Nehad Mohamed
6. Matthew Hunt
7. Daniel Chappell
8. Yan Hu
9. Cecelia Miller
10. Weiqiang Zhao
11. Matthew Avenarius
12. Dan Jones
This article has no evaluationsLatest version Feb 3, 2026
Large-scale reconstructions of Drosophila transcriptome identify ten thousands of new transcripts and transcription readthrough events

This article has 7 authors:
1. Haonan Duanmu
2. Meizhen Li
3. Zihan Zhou
4. Xinyan Li
5. Hao Chen
6. Kang He
7. Fei Li
This article has no evaluationsLatest version Feb 20, 2026
META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

This article has 8 authors:
1. Louis-Maël Guéguen
2. Alban Mathieu
3. Simon Pelletier
4. Anthony Woo
5. Namita Misra
6. Magali Moreau
7. Olivier Perin
8. Arnaud Droit
This article has no evaluationsLatest version Jan 29, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Optimizing bioinformatic workflows to extract clinically usable gene expression data from targeted RNA sequencing panels: comparison with total RNAseq

Large-scale reconstructions of Drosophila transcriptome identify ten thousands of new transcripts and transcription readthrough events

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing