MuseDrift: Navigating Protein Evolutionary Manifolds with Conditional Discrete Diffusion
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Protein engineering often requires generating variants of a wild-type (WT) sequence while controlling how far they drift in sequence space. Existing generative models support de novo design but offer limited control over WT similarity. We introduce M use D rift , a conditional discrete diffusion model for WT-anchored, distance-controlled protein generation. Trained on a 38.2M-pair Seed-and-Stratify corpus, M use D rift combines WT-prefix conditioning with random-order iterative unmasking to enable stable multi-residue generation. Its key feature is a calibrated identity dial : after lightweight calibration, generated sequences match a target WT identity τ within approximately ±0.05 over τ ∈ [0.55, 0.95] on held-out probes. On Mol-Instructions and CAMEO under shared evaluation oracles, M use D rift is competitive with multimodal and text-conditioned baselines while uniquely providing explicit identity control. At τ = 0.95, it achieves pLDDT scores of 84.97 on Mol-Instructions and 83.14 on CAMEO with only 85M parameters, rivaling much larger 1.8B–2B models. Evolutionary and F old X analyses further support biological plausibility and structural stability.