GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations

Yi Shen
Guangshuo Cao
Jianghong Wu
Dijun Chen
Cong Feng
Ming Chen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Deciphering the mapping between linear biomolecular sequences and complex biological functions remains a central challenge in genomics. Although existing generative nucleotide language models have made substantial progress in modeling sequence distributions, they generally lack explicit access to high-level biological semantics, limiting their ability to support semantics-guided conditional generation. To address this limitation, we present GenNA, a generative nucleotide foundation model guided by natural-language annotations. GenNA is pretrained on a multimodal nucleotide-text corpus spanning 2,221 eukaryotic species and comprising approximately 416 billion characters, and learns the relationships between sequence patterns and functional annotations within a unified autoregressive framework. Systematic evaluations show that, even without explicit supervision from biological rules, GenNA yields distinguishable perplexity scores in response to semantic mismatches between sequences and functional annotations, to different mutation types, and to perturbations of species labels. Moreover, across a range of natural-language-guided nucleotide generation tasks, the model produces sequences consistent with both target semantics and species context. Overall, GenNA provides a unified framework for natural-language-guided nucleotide modeling and conditional generation, and offers a feasible route toward integrating high-level functional descriptions with low-level sequence design.

Version published to 10.64898/2026.04.22.720063 on bioRxiv
Apr 24, 2026

A trainable language model for modulating translation rates in non-model organisms by generating upstream untranslated region sequence libraries

This article has 3 authors:
1. Alexander D. Duggan
2. Matthew P. Newman
3. David R. McMillen
This article has no evaluationsLatest version Apr 20, 2026
LAMBDA: A Prophage Detection Benchmark for Genomic Language Models

This article has 12 authors:
1. LeAnn M. Lindsey
2. Nicole L. Pershing
3. Keith Dufault-Thompson
4. Ho-jin Gwak
5. Anisa Habib
6. Aaron Schindler
7. Arjun Rakheja
8. June Round
9. W. Zac Stephens
10. Anne J. Blaschke
11. Hari Sundar
12. Xiaofang Jiang
This article has no evaluationsLatest version Mar 26, 2026
EvoRMD: Integrating Biological Context and Evolutionary RNA Language Models for Interpretable Prediction of RNA Modifications

This article has 6 authors:
1. Bo Wang
2. Hao Zhang
3. Taoyong Cui
4. Xiaoyu Wang
5. Jiangning Song
6. Hao Xu
This article has no evaluationsLatest version Mar 25, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A trainable language model for modulating translation rates in non-model organisms by generating upstream untranslated region sequence libraries

LAMBDA: A Prophage Detection Benchmark for Genomic Language Models

EvoRMD: Integrating Biological Context and Evolutionary RNA Language Models for Interpretable Prediction of RNA Modifications