H3BERTa: A CDR-H3 specific language model for antibody repertoire analysis

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Antibodies are central to immune defense and therapeutic design, yet predicting which sequences confer functional activity remains challenging. Deep learning models trained on full variable regions often struggle due to sparse experimental data, signal dilution from conserved framework residues, and the extreme diversity of hypervariable loops. The heavy-chain complementarity-determining region 3 (CDR-H3) is the most variable segment shaping antigen specificity and driving immune diversity. Here, we present H3BERTa, a transformer-based language model trained solely on CDR-H3 sequences, to test whether this short region alone encodes enough biologically meaningful information. H3BERTa embeddings recapitulate biologically relevant sequence features, including J-gene usage and inferred B-cell maturation state. We further show that pseudo-perplexity profiles can be used to analyze repertoires, distinguishing healthy from HIV-1–derived sequences and suggesting measurable immune response signatures. Finally, these embeddings can support classifiers for broadly neutralizing antibodies (bnAbs) using limited labeled sequences, demonstrating their potential for accelerating antibody discovery. Together, our results indicate that the CDR-H3 region alone encodes a rich immunological signature, which H3BERTa robustly captures, providing a focused computational tool for analyzing repertoire diversity and informing antibody engineering.

Article activity feed