Fine-tuned Protein Language Model Identifies Antigen-specific B Cell Receptors from Immune Repertoires

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Scalable identification of antigen-specific antibodies from whole immune repertoire V(D)J sequences is a central challenge in biomedical engineering. We show that protein language models (PLMs) fine-tuned on antibody heavy-chain sequences can directly predict antigen specificity from unselected immune repertoires. We assessed our model, Antigen Specificity Predictor (ASPred), against SARS-CoV-2, influenza, and HIV-AIDS antigens, observing comparable predictive performance. In the whole immune repertoire V(D)J sequences of mice immunized with the SARS-CoV-2 spike protein receptor-binding domain (RBD), ASPred identified antibody sequences specific to RBD. Several candidate sequences were validated, including one as a heavy chain-only nanobody with 20.7 nM dissociation constant. Molecular dynamics simulations supported the predicted interactions at coarse-grained and atomic levels. Benchmarking against Barcode-Enabled Antigen Mapping (BEAM) of B cell receptor sequence data had highly significant overlaps with ASPred predictions, suggesting scalability. The predicted SARS-CoV-2 binders differed substantially from training sequences, demonstrating generalization beyond sequence memorization. Together, we establish that heavy chain antibody sequences encode sufficient information for PLMs to infer specificity, offering a scalable framework for antibody discovery with broad applications.

Article activity feed