GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Deciphering the mapping between linear biomolecular sequences and complex biological functions remains a central challenge in genomics. Although existing generative nucleotide language models have made substantial progress in modeling sequence distributions, they generally lack explicit access to high-level biological semantics, limiting their ability to support semantics-guided conditional generation. To address this limitation, we present GenNA, a generative nucleotide foundation model guided by natural-language annotations. GenNA is pretrained on a multimodal nucleotide-text corpus spanning 2,221 eukaryotic species and comprising approximately 416 billion characters, and learns the relationships between sequence patterns and functional annotations within a unified autoregressive framework. Systematic evaluations show that, even without explicit supervision from biological rules, GenNA yields distinguishable perplexity scores in response to semantic mismatches between sequences and functional annotations, to different mutation types, and to perturbations of species labels. Moreover, across a range of natural-language-guided nucleotide generation tasks, the model produces sequences consistent with both target semantics and species context. Overall, GenNA provides a unified framework for natural-language-guided nucleotide modeling and conditional generation, and offers a feasible route toward integrating high-level functional descriptions with low-level sequence design.

Article activity feed