Protein language model pseudolikelihoods capture features of in vivo B cell selection and evolution
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
B cell selection and evolution play crucial roles in dictating successful immune responses. Recent advancements in sequencing technologies and deep-learning strategies have paved the way for generating and exploiting an ever-growing wealth of antibody repertoire data. The self-supervised nature of protein language models (PLMs) has demonstrated the ability to learn complex representations of antibody sequences and has been leveraged for a wide range of applications including diagnostics, structural modeling, and antigen-specificity predictions. PLM-derived likelihoods have been used to improve antibody affinities in vitro, raising the question of whether PLMs can capture and predict features of B cell selection in vivo. Here, we explore how general and antibody-specific PLM-generated sequence pseudolikelihoods (SPs) relate to features of in vivo B cell selection such as expansion, isotype usage, and somatic hypermutation (SHM) at single-cell resolution. Our results demonstrate that the type of PLM and the region of the antibody input sequence significantly affect the generated SP. Contrary to previous in vitro reports, we observe a negative correlation between SPs and binding affinity, whereas repertoire features such as SHM, isotype usage, and antigen specificity were strongly correlated with SPs. By constructing evolutionary lineage trees of B cell clones from human and mouse repertoires, we observe that SHMs are routinely among the most likely mutations suggested by PLMs and that mutating residues have lower absolute likelihoods than conserved residues. Our findings highlight the potential of PLMs to predict features of antibody selection and further suggest their potential to assist in antibody discovery and engineering.
Key points
In contrast to previous in vitro work (Hie et al., 2024), we observe a negative correlation between PLM-generated sequence pseudolikelihood (SP) and binding affinity. This contrast can be explained by the inherent antibody germline bias posed by PLM training data and the difference between in vivo and in vitro settings.
Our findings also reveal a considerable correlation between SPs and repertoire features such as the V-gene family, isotype, and the amount of somatic hypermutation (SHM). Moreover, labeled antigen-binding data suggested that SP is consistent with antigen-specificity and binding affinity.
By reconstructing B cell lineage evolutionary trajectories, we detected predictable features of SHM using PLMs. We observe that SHMs are routinely among the most likely mutations suggested by PLMs and that mutating residues have lower absolute likelihoods than conserved residues.
We demonstrate that the region of antibody sequence (CDR3 or full V(D)J) provided as input to the model, as well as the type of PLM used, influence the resulting SPs.