A Transformer-Based Ensemble Approach For DNA Splice Junction Classification

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The precise foretelling of DNA splice junctions is one of the most fundamental problems in computational biology and is also important for understanding the structure and function of the gene. The most complex thing is identifying the type of junction IE, EI or N from the given DNA sequence. Junction prediction is critical for determining the gene expression patterns, splicing regulation, disease etiology and genome architecture. It is very difficult to localize the boundaries for intron removal and exon inclusion, because there is no single rule that governs the process of RNA splicing. To address this problem, the proposed work incorporates an ensemble learning approach in the form of a two-layered hybrid model. In the first layer, feature extraction is done by two transformers DNABERT and DISTILBERT. DNABERT, a transformer model tailored for DNA sequences, is pre-trained on genomic data using the masked language model approach, which helps capturing long and short range dependencies, hence improving feature extraction. DISTILBERT’s lighter architecture ensures that efficient feature extraction, while maintaining competitive performance. The second layer uses Stacking Ensemble classification model trained on the features fed into it from the first layer. The proposed model splits the splice junction dataset into training and testing sets using cross validation, allowing a thorough analysis of the classifier’s generalization capacity. The ensemble model is trained on various divisions of training set and testing set. The UCI splice junction dataset was used to test our approach. In contrast to other machine learning models for DNA splice junction classification, the proposed DNABERT + Stacking ensemble method achieved an accuracy of 95%, and the DISTILBERT+ Stacking ensemble method achieved an accuracy of 91%. These findings specify that DNABERT transformer is better suited for tasks requiring detail contextual understanding , DISTILBERT transformer offers significant computational efficiency, making it a better choice for applications with resource constraints.

Article activity feed