Fine-tuned Protein Language Model Identifies Antigen-specific B Cell Receptors from Immune Repertoires
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Scalable identification of antigen-specific antibodies from whole immune repertoire V(D)J sequences is a central challenge in biomedical engineering. We show that protein language models (PLMs) fine-tuned on antibody heavy-chain sequences can directly predict antigen specificity from unselected immune repertoires. We assessed our model, Antigen Specificity Predictor (ASPred), against SARS-CoV-2, influenza, and HIV-AIDS antigens, observing comparable predictive performance. In the whole immune repertoire V(D)J sequences of mice immunized with the SARS-CoV-2 spike protein receptor-binding domain (RBD), ASPred identified antibody sequences specific to RBD. Several candidate sequences were validated, including one as a heavy chain-only nanobody with 20.7 nM dissociation constant. Molecular dynamics simulations supported the predicted interactions at coarse-grained and atomic levels. Benchmarking against Barcode-Enabled Antigen Mapping (BEAM) of B cell receptor sequence data had highly significant overlaps with ASPred predictions, suggesting scalability. The predicted SARS-CoV-2 binders differed substantially from training sequences, demonstrating generalization beyond sequence memorization. Together, we establish that heavy chain antibody sequences encode sufficient information for PLMs to infer specificity, offering a scalable framework for antibody discovery with broad applications.