PlasmidLM: A Promptable DNA Language Model via Verifiable-Reward Post-Training

McClain Thiel
Chris P. Barnes

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Generative DNA models are typically next-token completers: they extend a sequence but offer no native interface for telling the model what to make. PlasmidLM is a promptable DNA language model for plasmids. A designer supplies a human-readable component specification, for example a high-copy E. coli vector with kanamycin resistance and an EGFP reporter, and the model generates the corresponding multi-kilobase construct in a single autoregressive pass. Prompts are unordered sets of named-part tokens at the granularity of biological shorthand, not learned latent codes or rigid grammars. We evaluate outputs along two axes: a sequence is viable if structurally plausible as a plasmid, and faithful if its detected components match the prompt. Their conjunction is the useful-plasmid rate , the primary metric we report. On a held-out 1,000-prompt benchmark, the post-trained model achieves a useful-plasmid rate of 48.5% at single-shot decoding and 89.7% under best-of-4 sampling. Verifiable-reward post-training with GRPO against a 660-entry sequence motif registry improves the useful-plasmid rate across all sampling budgets. We release the 19.3M-parameter model, evaluation suite, and a paired benchmark of prompt-sequence pairs.

Version published to 10.64898/2026.05.19.725242 on bioRxiv
May 21, 2026

Carbon: Decoding the Language of Life

This article has 14 authors:
1. Loubna Ben Allal
2. Qiuyi Li
3. Maurizio Fiusco
4. Lewis Tunstall
5. Kashif Rasul
6. Ed Beeching
7. Dana Aubakirova
8. Carlos Patiño
9. Thibaud Frere
10. Anton Lozhkov
11. Georgia Channing
12. Thomas Wolf
13. Diego di Bernardo
14. Leandro von Werra
This article has no evaluationsLatest version May 25, 2026
MINA: linear probes reveal coding-sequence family signal in frozen DNA encoders

This article has 3 authors:
1. Austin Wijaya
2. Hayden Leung
3. Hyoungjoon Yoo
This article has no evaluationsLatest version May 28, 2026
A high-level programming language for generative biology with Proto

This article has 10 authors:
1. Aditi T. Merchant
2. Daniel Guo
3. Ben Viggiano
4. Lucas Brennan-Almaraz
5. Evelyn Hur
6. Tina Mai
7. Peter Yin
8. Samuel H. King
9. Euan A. Ashley
10. Brian L. Hie
This article has no evaluationsLatest version Jun 23, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Carbon: Decoding the Language of Life

MINA: linear probes reveal coding-sequence family signal in frozen DNA encoders

A high-level programming language for generative biology with Proto