MuseDrift: Navigating Protein Evolutionary Manifolds with Conditional Discrete Diffusion

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Protein engineering often requires generating variants of a wild-type (WT) sequence while controlling how far they drift in sequence space. Existing generative models support de novo design but offer limited control over WT similarity. We introduce M use D rift , a conditional discrete diffusion model for WT-anchored, distance-controlled protein generation. Trained on a 38.2M-pair Seed-and-Stratify corpus, M use D rift combines WT-prefix conditioning with random-order iterative unmasking to enable stable multi-residue generation. Its key feature is a calibrated identity dial : after lightweight calibration, generated sequences match a target WT identity τ within approximately ±0.05 over τ ∈ [0.55, 0.95] on held-out probes. On Mol-Instructions and CAMEO under shared evaluation oracles, M use D rift is competitive with multimodal and text-conditioned baselines while uniquely providing explicit identity control. At τ = 0.95, it achieves pLDDT scores of 84.97 on Mol-Instructions and 83.14 on CAMEO with only 85M parameters, rivaling much larger 1.8B–2B models. Evolutionary and F old X analyses further support biological plausibility and structural stability.

Article activity feed