From nucleotides to semantics: genomic representation learning via joint-embedding predictive architecture

Chengsen Wang
Qi Qi
Haifeng Sun
Zirui Zhuang
Bo He
Siying Liu
Jianxin Liao
Jingyu Wang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Decoding the regulatory syntax encoded in genomic sequences is a central objective in computational biology. Most existing genomic foundation models treat DNA as a language and adopt pretraining objectives from natural language processing. DNA sequences, however, lack explicit semantic boundaries and contain substantial evolutionary noise. Nucleotide-level reconstruction in a low-dimensional input space can therefore increase computational overhead and may yield representations with limited discriminative capacity. Downstream tasks often depend on expensive finetuning, which restricts practical use in many biology laboratories. Here we present GenoJEPA, a genomic representation learning framework based on joint-embedding predictive architecture. GenoJEPA combines continuous patching with semantic alignment, shifting the optimization from local base reconstruction to semantic alignment in latent space. Across 55 downstream tasks, GenoJEPA shows strong representational capacity and robust generalization while reducing parameter count and computational cost. The resulting semantic vectors from frozen GenoJEPA support lightweight GPU-free classifiers to achieve competitive accuracy. These results suggest a practical route towards efficient training and broad application of larger-scale genomic foundation models.

Version published to 10.64898/2026.04.02.716255 on bioRxiv
Apr 6, 2026

Hidden State Genomics: Graph-Based Analysis of Sparse Auto-Encoder Feature Activity in Genomic Language Models

This article has 3 authors:
1. Eliot Kmiec
2. Samuel O’Brien
3. Matthew McCoy
This article has no evaluationsLatest version May 16, 2026
Bio-BLIP: A Multimodal Architecture for Transferable Reasoning in Genomic Variant Interpretation

This article has 4 authors:
1. Anvita Gupta
2. Alejandro Buendia
3. Anshul Kundaje
4. Jure Leskovec
This article has no evaluationsLatest version May 15, 2026
GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations

This article has 6 authors:
1. Yi Shen
2. Guangshuo Cao
3. Jianghong Wu
4. Dijun Chen
5. Cong Feng
6. Ming Chen
This article has no evaluationsLatest version Apr 24, 2026

From nucleotides to semantics: genomic representation learning via joint-embedding predictive architecture

Discuss this preprint

Listed in

Abstract

Article activity feed

Hidden State Genomics: Graph-Based Analysis of Sparse Auto-Encoder Feature Activity in Genomic Language Models

Bio-BLIP: A Multimodal Architecture for Transferable Reasoning in Genomic Variant Interpretation

GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Hidden State Genomics: Graph-Based Analysis of Sparse Auto-Encoder Feature Activity in Genomic Language Models

Bio-BLIP: A Multimodal Architecture for Transferable Reasoning in Genomic Variant Interpretation

GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations