DNALongBench: A Benchmark Suite for Long-Range DNA Prediction Tasks
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Modeling long-range DNA dependencies is crucial for understanding genome structure and function across a wide range of biological contexts. However, effectively capturing these extensive dependencies, which may span millions of base pairs in tasks such as three-dimensional (3D) chromatin folding prediction, remains a significant challenge. Furthermore, a comprehensive benchmark suite for evaluating tasks that rely on long-range dependencies is notably absent. To address this gap, we introduce DNAL ong B ench , a benchmark dataset encompassing five important genomics tasks that consider long-range dependencies up to 1 million base pairs: enhancer-target gene interaction, expression quantitative trait loci, 3D genome organization, regulatory sequence activity, and transcription initiation signals. To comprehensively assess DNAL ong B ench , we evaluate the performance of five methods: a task-specific expert model, a convolutional neural network (CNN)-based model, and three fine-tuned DNA foundation models – HyenaDNA, Caduceus-Ph, and Caduceus-PS. We envision DNAL ong B ench as a standardized resource with the potential to facilitate comprehensive comparisons and rigorous evaluations of emerging DNA sequence-based deep learning models that account for long-range dependencies.