Explicit Dynamic Cross-Strand Interactions for DNA Sequence Language Modeling

Xiao Luo
Cheng Yang
Yuansheng Liu
Lei Ling
Fengxin Li
Changjian Chen
Long Wang
Feng Yu
Liang Qiao
Xiangxiang Zeng
Kenli Li
Alexander Schönhuth

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models are driving a transformation in the functional interpretation of genomic sequences. Methodologically, existing DNA sequence language models can be broadly categorized into two types: one that treats DNA sequence as unidirectional text for single-strand modeling, and another that achieves reverse-complement symmetry through data augmentation or model equivariance for static double-strand modeling. Both approaches predominantly approximate double-strand interactions in an implicit and static manner, struggling to capture context-driven cross-strand information exchange during sequence representation learning. In reality, double-strand information exchange is not an isolated event but is regulated by continuous physical coupling, functional synergy, and information transfer—a mechanism fundamental to genomic function. Based on this, we propose CrossDNA, an explicit and dynamic language model for DNA cross-strand modeling. Specifically, CrossDNA employs a dual-branch architecture with rotating input of double-strand sequence data to simulate the continuous information flow in the DNA double helix, establishes inter-strand communication via a lightweight TokenBridge module, and incorporates Comba with window-sliding attention (SWA) to capture long-range dependencies, while maintaining reverse-complement equivalence and stabilizing single-strand contextual semantics through self-distillation and consistency constraints from a branch teacher model. On tasks such as classification, regression, and representation, CrossDNA achieves consistent performance improvements and significantly enhances model robustness to sequence orientation, particularly in enhancer prediction where it more readily identifies features with clear biological significance. On multiple benchmarks we evaluated, CrossDNA, with a model size of only a few million parameters, matches or surpasses the performance of large models with hundreds of millions of parameters, substantially reducing training and inference costs and demonstrating high parameter efficiency and usability. Overall, CrossDNA advances DNA representation from implicit, static approximation to explicit, dynamic systematic modeling, signaling the orientation for a new generation of DNA language models and laying the foundation for deeper analysis of genomic structure and function.

Version published to 10.21203/rs.3.rs-8332638/v1 on Research Square
Jan 8, 2026

GENERator: A Long-Context Generative Genomic Foundation Model

This article has 18 authors:
1. Qiuyi Li
2. Wei Wu
3. Yuanyuan Zhang
4. Zhihao Zhan
5. Ruipu Chen
6. Mingyang Li
7. Kun Fu
8. Junyan Qi
9. Yongzhou Bao
10. Chao Wang
11. Yiheng Zhu
12. Zhiyun Zhang
13. Jian Tang
14. Fuli Feng
15. Jieping Ye
16. Liu Yuwen
17. Hui Xiong
18. Zheng Wang
This article has no evaluationsLatest version Feb 4, 2026
Emergence of Biological Structural Discovery in General-Purpose Language Models

This article has 1 author:
1. Liang Wang
This article has no evaluationsLatest version Jan 8, 2026
Function-Driven Molecular Design Enabled by Instruction-Tuned Large Language Models

This article has 12 authors:
1. Qianfan Yang
2. Xurui Wang
3. Yanxi Wang
4. Ruizhao Zhu
5. Hailiang Li
6. Xinghong Wu
7. Xinyi Zhang
8. Mingyuan Zhou
9. Huaiwen Pu
10. Kaicong Cai
11. Yanan Tang
12. Feng Li
This article has no evaluationsLatest version Feb 12, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

GENERator: A Long-Context Generative Genomic Foundation Model

Emergence of Biological Structural Discovery in General-Purpose Language Models

Function-Driven Molecular Design Enabled by Instruction-Tuned Large Language Models