DyAb: sequence-based antibody design and property prediction in a low-data regime
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
Protein therapeutic design and property prediction are frequently hampered by data scarcity. Here we propose a new model, DyAb, that addresses these issues by leveraging a pair-wise representation to predict differences in protein properties, rather than absolute values. DyAb is built on top of a pre-trained protein language model and achieves a Spearman rank correlation of up to 0.85 on binding affinity prediction across molecules targeting three different antigens (EGFR, IL-6, and an internal target), given as few as 100 training data. We employ DyAb in two design contexts: as a ranking model to score combinations of known mutations, and combined with a genetic algorithm to generate new sequences. Our method consistently generates novel antibody candidates with high binding rates, including designs that improve on the binding affinity of the lead molecule by more than ten-fold. DyAb represents a powerful tool for engineering therapeutic protein properties in low data regimes common in early-stage drug development.
Article activity feed
-
Supplementary Fig. S
what do the different colors signify in the plots?
-
incorporating only mutations found in previously stable sequence
does that mean that the GA will not consider mutations not encountered in the binding affinity datasets?
-
Designs express and bind at consistently high rates (> 85%), comparable to that of singlepoint mutants.
it would be interesting to see a naive control, i.e. what is the average expression and binding rate if you just make N point mutations at random?
-
66 pM, exhibiting a near 50-fold improvement
Very impressive!
-
DyAb performance on the regression task for design sets are shown in Supplementary Fig.S3
from S3a, it looks like DyAb is not very predictive with the lead A dataset, but performs much better on the others even though they have equal/fewer data points. Any idea on why this is?
-