Benchmarking Genomic Foundation Models for Gene Fusion Detection from DNA Sequences
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Gene fusions are critical drivers of oncogenesis and diagnostic biomarkers in various cancers. However, their detection from RNA or DNA sequencing, when performed using traditional analytical methods, encounters challenges related to sample quality, computational complexity, and noise. Although deep learning is more robust, it usually requires large labeled datasets and substantial training resources. Genomic foundation models (GFMs), which are pre-trained on pangenome-scale data, offer a promising solution to these issues. Methods This study presents the first comprehensive benchmark of four transformer-based GFMs, Nucleotide Transformer, Evo2, HyenaDNA, and DNABERT2, for gene fusion detection. Using the curated FusionAI dataset of ~ 52,000 sequences, we extracted embeddings from 10-kilobase-pair (kbp) DNA sequences surrounding fusion breakpoints. We evaluated the quality of these representations qualitatively using t-SNE visualization and quantitatively by training lightweight classifiers (Support Vector Machines and simple Neural Networks) on the fixed embeddings. Results The Nucleotide Transformer achieved the best performance with an accuracy of 0.967 and an F1 score of 0.967. This result outperformed the dedicated deep learning baseline (FusionAI, with an accuracy of 0.894). Evo2 was the second-best performer (accuracy: 0.920), demonstrating robustness derived from evolutionary pretraining. Conversely, DNABERT2 failed to compete (accuracy 0.677–0.723). Furthermore, sample efficiency analysis revealed that the Nucleotide Transformer required only ~ 2,600 samples to reach 95% of its peak performance, whereas the baseline required over 14,000 samples. Conclusions These findings demonstrate that advanced GFMs, particularly the NT and Evo2 models, generate highly discriminative 'out-of-the-box' embeddings. These embeddings significantly outperform dedicated deep learning baselines while requiring a fraction of the training data and computational time. This suggests that GFMs could be a scalable, data-efficient way of developing precise genomic diagnostic tools, particularly for rare diseases.