Precise CDR Position Control in Antibody Sequence Generation Using Conditional Deep Generative Models
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Controllable full-length antibody sequence generation requires explicit localization of complementarity-determining regions (CDRs), but most autoregressive pipelines optimize global likelihood without machine-readable boundary guarantees. We formulate CDR position control as a sequence modeling objective by inserting explicit CDR1/2/3 boundary tokens and optional property-conditioning tokens. On top of this representation, we introduce CDBO (CDR Boundary-Order constrained decoding), which enforces legal boundary-token progression during decoding, and an auxiliary training objective, L = L_lm + λ1 L_boundary + λ2 L_property, for boundary and condition supervision. The full workflow is recomputed from source repertoire data with deterministic quality control, marking validation, generated-sequence composition analysis, and leakage auditing. From 11,228,600 raw records, 11,078,824 pass quality control (98.67% retention), and marker insertion validation reaches 100.00% success on 5,000 samples. In three-seed four-way ablation, the Full model (Aux + CDBO) achieves the highest CDR boundary-order fidelity (0.9333 +/- 0.0764) versus Base (0.3500 +/- 0.0000), while maintaining strong sequence validity. These results support explicit boundary-aware control as a practical route for reproducible and biologically aligned antibody generation.