The battle for reads: evaluating strategies to tackle multi-mapping in RNA-seq quantification in highly repetitive genomes
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: RNA sequencing (RNA-seq) enables transcript quantification and isoform analysis in diverse biological contexts, but accurately measuring expression from highly related genomic regions remains challenging. Multi-mapped reads—those aligning equally well to multiple loci—pose a major computational hurdle and compromise the overall accuracy of transcriptome resolution. Results: We herein evaluated five RNA-seq pipelines—Bowtie2 + featureCounts, STAR + featureCounts, STAR + Salmon, Salmon, and Kallisto—on their ability to quantify gene expression in Trypanosoma cruzi , a parasitic protozoan with a highly repetitive genome characterized by the abundance of large multigene families.Using real RNA-seq data, we first compared gene-level outputs, with emphasis on multigene family representation. Simulated transcriptomes were used to benchmark quantification accuracy under controlled conditions. Among the best-performing strategies (Salmon, Kallisto, and STAR + Salmon), we further tested whether including untranslated regions (UTRs) in gene annotations improved the assignment of ambiguous reads. Conclusions: Overall, the alignment-free transcriptome quantifiers Salmon and Kallisto achieved the most accurate performance, closely matching simulated values. Incorporating UTR annotations improved read assignment accuracy, particularly for STAR + Salmon. These tools not only enable global expression quantification but also facilitate precise read allocation between members of the same gene family, with up to 98% sequence identity. Our results highlight the critical role of annotation quality and quantification strategy in improving gene expression estimates.