An evaluation of clustering and assembly strategies from Iso-Seq data in the absence of reference genomes in non-model animals

Read the full article See related articles

Discuss this preprint

Start a discussion

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Transcriptome assembly enables the recovery of expressed genes and isoforms, but the optimal strategy for reconstructing transcriptomes from long-read sequencing remains unresolved. In particular, establishing best practices for generating accurate gene models and selecting representative isoforms is essential for comparative genomics, as for orthology inference typically only the longest isoform per gene model is included. Here, we systematically compare clustering and de novo assembly methods using PacBio Iso-Seq data from 17 animal lineages spanning seven phyla, most of them non-model species, with the goal of investigating which methodology is more adequate to select one isoform per gene model, in the absence of specific pipelines to do so. We evaluate four approaches: isoseq cluster, CD-HIT, RNA-Bloom2 and isONform. We benchmark them with short-reads using Trinity, assessing assembly quality with BUSCO completeness, short-read mapping rates, coding sequence recovery, and longest isoform prediction. Our results show that CD-HIT clustering at high similarity thresholds (≥99%) yields the most complete and coding-rich long-read transcriptomes, rivaling Trinity while avoiding its high redundancy. Consensus-based methods such as isoseq cluster and isONform recover fewer single-copy orthologs (mirrored in a lower BUSCO score) and achieve lower mapping rates, while RNA-Bloom2 provide intermediate performance with reduced duplication. Together, these findings establish, to date, CD-HIT as a robust and practical strategy for transcriptome reconstruction from long-read data when genomic references are unavailable. By benchmarking de novo methods across a taxonomically broad dataset, this work defines the realistic capabilities of long-read transcriptome reconstruction in the absence of a reference genome and provides practical guidance for deriving high-quality gene models and selecting representative isoforms for orthology inference in non-model species.

Article activity feed