SNAC-DB: An ML-Ready Database for Antibody and NANO-BODY® VHH–Antigen Complexes with Expanded Structural Diversity and Real-World Benchmarking

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Predicting antibody and NANOBODY ® VHH–antigen complexes remains a critical challenge for state-of-the-art structure prediction models, limiting their impact in therapeutic discovery pipelines. We introduce SNAC-DB, an ML-ready database and curation pipeline enriched with structural biology expertise, designed to accelerate model accuracy and generalization by providing 31–37% expanded structural diversity over existing resources like SAbDab through comprehensive re-curation that extracts maximum value from available experimental structures. SNAC-DB expands coverage by capturing often-overlooked complexes and accurately identifying complete multi-chain epitopes through improved biological-assembly-based logic. Built for ML practitioners, SNAC-DB provides standardized formats with multi-threshold structure-based clustering to enable principled sample weighting during training. Using a rigorous benchmark of public PDB entries deposited post-May 2024 plus confidential therapeutic structures, we evaluate seven leading models (Protenix-v1, OpenFold-3p2, RosettaFold-3, Boltz-2, Boltz-1x, Chai-1, and AlphaFold2.3-multimer) with evaluation methodology tailored to antibody/NAN-OBODY ® VHH–antigen complexes to ensure correct handling of multi-chain epitopes, revealing systematic performance gaps: success rates rarely exceed 25%, confidence-based ranking fails to identify best predictions even when accurate structures exist in ensembles, and all models consistently struggle with therapeutically relevant NANOBODY ® VHHs. Systematic evaluation of sampling strategies demonstrates that while generating 1000 samples per target substantially increases the likelihood of producing accurate structures (oracle selection improves from 11.9% to 50.5%), confidence-based ranking remains nearly flat (between 10.9% and 14.9%), revealing that improved ranking mechanisms represent a more tractable path to performance gains. Finally, fine-tuning GeoDock on SNAC-DB yields higher success rates than training on SAbDab (11.0% vs. 7.1% for antibodies; 7.0% vs. 4.0% for NANOBODY ® VHHs), suggesting that SNAC-DB’s expanded structural diversity translates to improved model generalization.

Significance Statement

Computational antibody/NANOBODY ® VHH design shows promise but remains unreliable for therapeutic development. SNAC-DB provides 31–37% expanded structural diversity through comprehensive data curation, immediately accelerating model development. Benchmarking seven leading AI models reveals accuracy rarely exceeds 25% on therapeutic targets, with confidence-based ranking failing to identify correct structures even when they exist in model outputs. Training on SNAC-DB increases prediction accuracy, validating that high-quality, diverse training data is critical for advancing computational methods toward clinical impact.

Article activity feed