CDR-aware masked language models for paired antibodies enable state-of-the-art binding prediction

Mahtab Talaei
Kenji C. Walker
Boran Hao
Eliot Jolley
Yeping Jin
Dima Kozakov
John Misasi
Sandor Vajda
Ioannis Ch. Paschalidis
Diane Joseph-McCarthy

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Antibodies are a leading class of biologics, yet their architecture with conserved framework regions and hypervariable complementarity-determining regions (CDRs) poses unique challenges for computational modeling. We present a region-aware pretraining strategy for paired heavy (VH) and light (VL) sequences in variable domains using ESM2-3B and ESM C (600M) protein language models. We compare three masking strategies: whole-chain, CDR-focused, and a hybrid approach. Through evaluation on binding affinity datasets spanning single-mutant panels and combinatorial mutants, we demonstrate that CDR-focused training produces superior embeddings for functional prediction. Notably, training only on VH-VL pairs proves sufficient, eliminating the need for massive unpaired pretraining that provides no measurable downstream benefit. Our compact 600M ESM C model achieves state-of-the-art performance, matching or exceeding larger antibody-specific baselines. These findings establish a principled framework for antibody language models: prioritize paired sequences with CDR-aware supervision over scale and complex training curricula to achieve both computational efficiency and predictive accuracy.

Version published to 10.1101/2025.10.31.685149 on bioRxiv
Oct 31, 2025

Structure-based Predictions of Conformational B Cell Epitopes by Protein Language Model and Deep Learning

This article has 7 authors:
1. Yuhao Zhang
2. Zhaoqian Su
3. Felipe Vilicich
4. Xiaohan Kuang
5. Yunchao Liu
6. Grace Zhang
7. Yinghao Wu
This article has no evaluationsLatest version Oct 30, 2025
AbTune: layer-wise selective Fine-Tuning of protein language models for Antibodies

This article has 2 authors:
1. Xiaotong Xu
2. Alexandre M.J.J. Bonvin
This article has no evaluationsLatest version Oct 17, 2025
Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction

This article has 5 authors:
1. Fuli Feng
2. Xinyuan Zhu
3. Jiadong Lu
4. Yeqing Lu
5. Yuyan Zhang
This article has no evaluationsLatest version Sep 11, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Structure-based Predictions of Conformational B Cell Epitopes by Protein Language Model and Deep Learning

AbTune: layer-wise selective Fine-Tuning of protein language models for Antibodies

Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction