Beyond Structure and Affinity: Context-Dependent Signals for de novo Binder Success
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
De novo protein binder design has advanced rapidly, yet most designs fail experimentally and current structure- and affinity-centred evaluation does not reliably predict which candidates will succeed. Here we show that biology-informed sequence features, derived from models trained on natural proteins, identify transferable and context-dependent associations with binder expression and binding that are not captured by structural scoring alone.
We re-analysed two public benchmarks—the Bits to Binders CAR-T CD20 competition (11,984 designs; expression, proliferation, and T cell function gates) and the Adaptyv EGFR competition (603 designs; expression and binding affinity)—using five biology-informed ML models predicting disorder, amyloidogenicity, topology, PTM sites, and protein classification. Every feature was tested at each gate with FDR-corrected statistics.
We identify three layers of signal. Transferable : lower aggregation propensity is the most robust cross-benchmark signal; PTM-site density recurs univariately but is partly length-confounded in EGFR. Architecture-dependent : topology, disorder, and disulfide-related descriptors are significant in both datasets but flip direction, consistent with the different requirements of CAR extracellular domains versus standalone binders. Context-specific : phosphorylation-related associations with CAR-T depletion and low-disorder dominance in EGFR binding are tied to individual assay or format contexts. In the CAR-T benchmark, stacking biology-informed filters raises the enrichment hit rate from 13.8% to 38.6% (2.8× lift) after controlling for known sequence-level predictors.
These results suggest that pre-synthesis screening of de novo binders may benefit from being multi-gate and context-aware, using biology-informed sequence descriptors not only to rank candidates but also to help flag likely failure modes earlier and reduce wasted synthesis and testing.