A transfer-learning approach to predict antigen immunogenicity and T-cell receptor specificity

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    In this important work, the authors present a sequence-based approach using transfer learning and Restricted Boltzmann Machines to predict antigen immunogenicity and specificity. The evidence and methodology are compelling. This work should be of interest to immunologists, computational biologists, and biophysicists.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Antigen immunogenicity and the specificity of binding of T-cell receptors to antigens are key properties underlying effective immune responses. Here we propose diffRBM, an approach based on transfer learning and Restricted Boltzmann Machines, to build sequence-based predictive models of these properties. DiffRBM is designed to learn the distinctive patterns in amino-acid composition that, on the one hand, underlie the antigen’s probability of triggering a response, and on the other hand the T-cell receptor’s ability to bind to a given antigen. We show that the patterns learnt by diffRBM allow us to predict putative contact sites of the antigen-receptor complex. We also discriminate immunogenic and non-immunogenic antigens, antigen-specific and generic receptors, reaching performances that compare favorably to existing sequence-based predictors of antigen immunogenicity and T-cell receptor specificity.

Article activity feed

  1. eLife assessment

    In this important work, the authors present a sequence-based approach using transfer learning and Restricted Boltzmann Machines to predict antigen immunogenicity and specificity. The evidence and methodology are compelling. This work should be of interest to immunologists, computational biologists, and biophysicists.

  2. Reviewer #1 (Public Review):

    In this work, the authors propose a "transfer learning" approach for modeling the properties of sequences that are selected from larger sequence pools on the basis of biophysical or functional properties, where the source populations may themselves be biased in composition. Examples include the set of immunogenic peptides, considered as a subset of all HLA-presented peptides, or the set of TCRs that are specific for a given peptide epitope, as selected from within the much larger pool of all peripheral TCRs. The motivation for transfer learning is that there may only be small numbers of selected sequences available for training and many more examples of the background sequences. Rather than directly fitting a single model on the selected sequences, the idea is to first fit a background model that captures the properties of the source/background population of sequences, using the many examples available for training, and then train a "differential" model that specifically seeks to capture the differences between the selected and background populations. This differential model is trained using the subset of selected sequences, by optimizing their likelihood under a composite model that combines the background model (whose parameters are frozen) and the differential model. The specific architecture used here is the "restricted Boltzmann machine" (RBM), which can be thought of as a generalization of the position-weight matrix approach that can capture pairwise and higher-order interactions between positions. The applications are the two mentioned above, prediction of immunogenic peptides and prediction of TCRs specific for a given peptide-MHC epitope. This work builds on previous work by the authors applying the RBM architecture to peptide-MHC binding [Bravi et al., 2021b] and T-cell responses [Bravi et al., 2021a]. The advance here is in formalizing the "differential" framework and testing immunogenicity prediction and epitope specificity. Considering the field and the current state of the art, the main contributions of the manuscript appear to be theoretical/conceptual, in introducing the "diffRBM" method and providing a range of evaluations of its performance, for example, the use of contact prediction to assess the model. For TCR-epitope prediction, it does not look like the method improves over methods like TCRex or TCRdist, though an advantage is that the parameters may be more interpretable than some black box machine learning approaches. Also for epitope prediction, as noted by the authors, the model may be learning features that differentiate TCRs expressed by CD8+ T cells from the background of all TCRs (which is probably weighted toward CD4+ T cells). This would explain the poorer performance discriminating TCRs specific for one MHC class I epitope from those specified for a different class I epitope. For immunogenicity prediction, evaluations are so dependent on the specifics of the datasets, and the feature itself is so murky, that it's hard to say whether there is a performance advance here.

    One nice feature of the diffRBM model is that scores ("single-site factors") can be assigned to individual amino acids in a peptide (or TCR) sequence that captures the contribution of that amino acid at that position to the overall score of the sequence, taking into account the sequence context. The authors show that these single-site factors, for the diffRBM model trained on immunogenic peptides, highlight positions that tend to be involved in TCR contacts as well as specific amino acids, such as "W at position 5", that have been found in previous studies to enhance TCR recognition. The single-site factors for a diffRBM model trained on epitope-specific TCRs appear to do a reasonable job of predicting CDR3 positions that contact the peptide.

    Overall, the conclusions of the study are well-supported and the descriptions of the method's performance are balanced. The manuscript is well-written, and the supporting information nicely addresses minor questions that come up in reading the main text. One minor quibble I have is with the description of the method as "unsupervised", especially in the TCR-epitope prediction setting, since the sequences provided to the diffRBM for training, and which the model is tasked with learning differences between, is exactly the positive and negative sequences for the AUROC calculations (up to train/test sampling). It is also confusing to me that the overall selection factors for TCR-epitope binding are so very modest (0.19 for Flu M158, for example; Figure S20D, this is the "effective fraction of sequences retained in selected data compared to background ones"). This doesn't seem like it can be correct, given how focused some of these epitope-specific repertoires are. Overall, though, the study and associated software tools are likely to be useful contributions to the field.

  3. Reviewer #2 (Public Review):

    The work by Bravi et al. introduces a learning technique based on Restricted Boltzmann machines, that uses analog to differential learning to model two distinct datasets being part of a common biophysical framework but that behave differently depending on a set of parameters with "background" and "select" features. The biological problem tackled by the authors is the prediction of immunogenetic peptides versus non-immunogenetic ones, as well as determining the sequence features related to binding recognition.

    My assessment of the strengths and weaknesses of this work is the following:

    Strengths

    The authors propose a novel and technically robust solution to a significant and currently unsolved problem in molecular immunology. They are detailed and exhaustive in the description of the formulation of their model as well as in the assessment analysis. Being this a hard problem, the results presented seem a very important step forward not only to solve some of these problems but also to provide convincing arguments that this methodology is more general than other previous approaches; that it can be applied to both immunogenicity prediction as well as binding specificity and is of generative nature. This can have a significant use in therapeutic applications. Another strength of this work is that their methodology could be easily applicable to other biological problems that deal with general versus selected features. For instance, specificity in recognition of other protein-protein interactions, protein-RNA recognition as well as the analysis of ever-growing SELEX and in vitro evolution datasets. Finally, I thought that the efforts of this work to provide "interpretable" learning models are important and definitely a strength of this work.

    Weaknesses

    As stated before, this work is detailed in nature and contains technical details to make it reproducible. However, in the attempt of the authors to compare against the large number of alternative approaches to this model, I felt that the readability of the article is affected. If this article is meant to be read by broader audiences that might utilize this framework in immunology research, at points the manuscript feels lost in comparison and descriptions of other methods. This is due to the fact that every time a new technical method is introduced, readers want to know about a comparison with other methods, but I feel that the manuscript can be rewritten in such a way that those technical comparisons don't become the major point of the paper and focuses more on how the predictive results of the model can be then applied in immunology. A similar point can also be raised about the methods section, although it has the advantage of being exhaustive and detailed, it also makes it hard for the reader to focus on the most important parts of the work. Perhaps, a better distribution of the methods and SI methods could help streamline the readability of this interesting work.

  4. Reviewer #3 (Public Review):

    The authors present in great detail a novel transfer of learning AI model architecture called diffRBM, which is based on the original RBM papers [Hinton, 2002, Hinton and Salakhutdinov, 2006]. They further show how this tool can be used to assess the immunogenicity of TCR positions and the importance of different by-position amino acid usages in creating this immunogenicity. They show that this novel method identifies all known important positions at least as well as existing analytical and structural methods, potentially in a more explanatory way.