Benchmarking Boltz-2 for Screening of Therapeutic Antibody-Antigen Interactions

Alexandra Fieux-Castagnet
Julian Waton
Alina Glukhonemykh
Eric Snow
Roshini Ashokkumar
Jess Fleming
David Champagne
Thomas Devenyns
Alex Peluffo
Chris Anagnostopoulos

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Protein structure prediction models (such as AlphaFold, Chai, Boltz) have transformed structural biology and are increasingly explored for drug discovery; however, their utility for large-scale screening of antibody-antigen (AB-AG) interactions remains unclear, particularly for distinguishing true binding from non-binding pairs at scale. To our knowledge, there has not been an exhaustive exploration of Boltz-2 inference settings on this high impact problem, and in this paper we set out to describe and implement a novel benchmarking framework that can accelerate progress in the field. We evaluated Boltz-2 (NVIDIA NIM implementation) on 519 therapeutic monoclonal antibodies from Thera-SAbDab, pairing each antibody with its cognate target and a randomly assigned non-cognate antigen. We developed a novel evaluation framework that systematically captures variability across stochastic seeds while benchmarking different inference settings, including datasets with and without crystallographically resolved antibody structures. Across settings, Boltz-2-derived confidence metrics showed weak, though above-chance, discrimination (0.5 < ROC-AUC < 0.60). Among evaluated metrics, the minimum value of the interface predicted TM-score (ipTM-min) across seed-samples, captured the strongest signal. Interestingly, additional feature aggregation and multivariate modelling provided little to no improvement. Increasing the number of stochastic predictions yielded front-loaded gains, with diminishing returns beyond ∼15–20 seed-samples, suggesting limited value of extensive sampling in practical workflows. Notably, inference without multiple sequence alignments (MSAs) slightly improved performance on non-crystallized antibodies (ΔAUROC ≈ +0.027) while reducing runtime by ∼8 seconds per prediction compared to shallow MSA settings. Overall, these results indicate that off-the-shelf confidence metrics from general-purpose structure prediction models may be insufficient for reliable target-antibody screening and highlight the need for task-specific optimization, while confirming that modest amounts of sampling can be helpful, but not in itself sufficient to improve performance significantly as gains plateau relatively quickly.

Version published to 10.64898/2026.05.13.724924 on bioRxiv
May 14, 2026

Systematic Benchmarking of Kinase Bioactivity Models Across Splitting Strategies and Protein Representations

This article has 1 author:
1. Joshua M. Abbott
This article has no evaluationsLatest version Apr 22, 2026
Does DrugCLIP Find the Right Pocket? A Systematic Evaluation of Binding-Site Identification Across 42 Drug Targets

This article has 4 authors:
1. Bocheng Xie
2. Xiaokang Guo
3. Pengwei Xiao
4. Chao Yang
This article has no evaluationsLatest version May 13, 2026
Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI Driven Molecular Property and Activity Prediction

This article has 1 author:
1. Jinjiang Guo
This article has no evaluationsLatest version May 4, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Systematic Benchmarking of Kinase Bioactivity Models Across Splitting Strategies and Protein Representations

Does DrugCLIP Find the Right Pocket? A Systematic Evaluation of Binding-Site Identification Across 42 Drug Targets

Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI Driven Molecular Property and Activity Prediction