IRCAS: a novel end-to-end approach to identify, rectify and classify comprehensive alternative splicing events in a transcriptome without genome reference

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Alternative splicing (AS) is a fundamental post-transcriptional mechanism that amplifies proteomic diversity and enables adaptive responses across eukaryotes. Current AS detection methods rely heavily on reference genomes, limiting their applicability to non-model organisms. Existing reference-free approaches suffer from inaccurate splice site prediction and treat detection and classification as separate processes, resulting in cascading errors. We present IRCAS, an integrated end-to-end framework for reference-free AS analysis, comprising three modules: identification, rectification, and classification. IRCAS employs colored de Bruijn graphs for AS detection, an attention-based CNN for splice site rectification, and a hybrid Graph Neural Network combining GAT and Transformer layers for classification. Evaluation across four species demonstrates substantial improvements: splice site accuracy increased to 92-96% versus 50-55% for existing methods, and end-to-end accuracy reached 83.4% compared to 41.2% for the previous best method. IRCAS establishes a new benchmark for reference-free AS detection in non-model organisms.

GRAPHICAL ABSTRACT

Fig 1.

Workflow for construction and application of IRCAS. IRCAS is composed of three parts: identification, rectification, classification. (A) Workflow for reference-free AS identification from a raw transcriptomic data. First, according to the input transcripts, we apply BLAST all versus all alignment for preliminary screen. Then we adopt the MkcDBGAS Graph construction strategy. A cDBG was constructed from two sequences using a specified k-mer size. Based on bubble topologies, bubbles were classified into 5 types: SNV-induced, four AS-induced, MX-induced, AL-induced, AF-induced. (B)Workflow for AS position offset rectification and reconstruction of cDBG. Input transcript pairs are converted into a single sequence that includes two virtual nucleotides denoting the splicing start and end sites. SUPPA, a reference-based method, is utilized to determine the true splicing sites. The sequence is encoded into an n×6 vector using one-hot encoding. The offset between the true and predicted splicing sites is calculated and encoded as the ground truth. An attention-based convolutional neural network (CNN) rectification model is trained on these data to predict the offset, enabling the reconstruction of the cDBG with corrected splicing positions. (C) Workflow for 4 types AS classification. For each cDBG, node features, edge features, and global features are extracted. These features are integrated into distinct layers of a graph attention network (GAT)-Transformer hybrid model. This architecture enables high-precision classification of four types of AS events. (D)Workflow for end-to-end application of IRCAS. Transcriptomic data from any species lacking a reference genome is processed by IRCAS, enabling the classification of seven AS types with high accuracy.

Article activity feed