Explicit Dynamic Cross-Strand Interactions for DNA Sequence Language Modeling
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large language models are driving a transformation in the functional interpretation of genomic sequences. Methodologically, existing DNA sequence language models can be broadly categorized into two types: one that treats DNA sequence as unidirectional text for single-strand modeling, and another that achieves reverse-complement symmetry through data augmentation or model equivariance for static double-strand modeling. Both approaches predominantly approximate double-strand interactions in an implicit and static manner, struggling to capture context-driven cross-strand information exchange during sequence representation learning. In reality, double-strand information exchange is not an isolated event but is regulated by continuous physical coupling, functional synergy, and information transfer—a mechanism fundamental to genomic function. Based on this, we propose CrossDNA, an explicit and dynamic language model for DNA cross-strand modeling. Specifically, CrossDNA employs a dual-branch architecture with rotating input of double-strand sequence data to simulate the continuous information flow in the DNA double helix, establishes inter-strand communication via a lightweight TokenBridge module, and incorporates Comba with window-sliding attention (SWA) to capture long-range dependencies, while maintaining reverse-complement equivalence and stabilizing single-strand contextual semantics through self-distillation and consistency constraints from a branch teacher model. On tasks such as classification, regression, and representation, CrossDNA achieves consistent performance improvements and significantly enhances model robustness to sequence orientation, particularly in enhancer prediction where it more readily identifies features with clear biological significance. On multiple benchmarks we evaluated, CrossDNA, with a model size of only a few million parameters, matches or surpasses the performance of large models with hundreds of millions of parameters, substantially reducing training and inference costs and demonstrating high parameter efficiency and usability. Overall, CrossDNA advances DNA representation from implicit, static approximation to explicit, dynamic systematic modeling, signaling the orientation for a new generation of DNA language models and laying the foundation for deeper analysis of genomic structure and function.