Data-optimal scaling of paired antibody language models

Mahdi Shafiei Neyestanak
Sarah M. Burbach
Karenna Ng
Praneeth Gangavarapu
Jonathan Hurtado
Judie Magura
Nasreen Ismail
Daniel Muema
Thumbi Ndung’u
Andrew B. Ward
Bryan Briney

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Scaling laws for large language models in natural language domains are typically derived under the assumption that performance is primarily compute-constrained. In contrast, antibody language models (AbLMs) trained on paired sequences are primarily data-limited, thus requiring different considerations. To explore how model size and data scale affect AbLM performance, we trained 15 AbLMs across all pairwise combinations of five model sizes and three training data sizes. From these experiments, we derive an AbLM-specific scaling law and estimate that training a data-optimal AbLM equivalent of the highly performant 650M-parameter ESM-2 protein language model would require ∼5.5 million paired antibody sequences. Evaluation on multiple downstream classification tasks revealed that significant performance gains emerged only with sufficiently large model size, suggesting that in data-limited domains, improved performance depends jointly on both model scale and data volume.

Version published to 10.1101/2025.09.02.673765 on bioRxiv
Sep 6, 2025

Scaling down protein language modeling with MSA Pairformer

This article has 5 authors:
1. Yo Akiyama
2. Zhidian Zhang
3. Milot Mirdita
4. Martin Steinegger
5. Sergey Ovchinnikov
This article has no evaluationsLatest version Aug 3, 2025
ProteinReasoner: A Multi-Modal Protein Language Model with Chain-of-Thought Reasoning for Efficient Protein Design

This article has 9 authors:
1. Chaozhong Liu
2. Linlin Chao
3. Shaomin Ji
4. Hao Wang
5. Taorui Jiang
6. Zhangyang Gao
7. Yucheng Guo
8. Ming Yang
9. Xiaoming Zhang
This article has no evaluationsLatest version Jul 24, 2025
Mechanistic evidence that motif-gated domain recognition drives contact prediction in protein language models

This article has 5 authors:
1. Jatin Nainani
2. Bryn Marie Reimer
3. Connor Watts
4. David Jensen
5. Anna G. Green
This article has no evaluationsLatest version Aug 28, 2025

Listed in

Abstract

Article activity feed

Related articles

Scaling down protein language modeling with MSA Pairformer

ProteinReasoner: A Multi-Modal Protein Language Model with Chain-of-Thought Reasoning for Efficient Protein Design

Mechanistic evidence that motif-gated domain recognition drives contact prediction in protein language models