Restoring data balance via generative models of T cell receptors for antigen-binding prediction

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife Assessment

    This valuable study introduces a data augmentation approach based on generative unsupervised models to address data imbalance in immune receptor modeling. Support for the findings is solid, showing that the use of generated data increases the performance of downstream supervised prediction tasks, e.g., TCR-peptide interaction prediction. However, the validation, mainly relying on synthetic data, could be completed, especially regarding unseen epitopes, and given the exclusive focus on CDR3β. The results should be of interest to the communities working on immunology and biological sequence data analysis.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Unveiling specificity in T cell recognition of antigens represents a major step to understand the immune system response. Many supervised machine learning approaches have been designed to build sequence-based predictive models of such specificity using binding and non-binding receptor-antigen data. Due to the scarcity of known specific T cell receptors for each antigen compared to the abundance of non-specific ones, available datasets are heavily imbalanced and make the goal of achieving solid predictive performances very challenging. Here, we propose to restore data balance through data augmentation using generative unsupervised models. We then use these augmented data to train supervised models for prediction of peptide-specific T cell receptors, or binding pairs of peptide and T cell receptor sequences. We show that our pipeline yields increased performance in prediction tasks of T cell receptors specificity. More broadly, our pipeline provides a general framework that could be used to restore balance in other computational problems involving biological sequence data.

Article activity feed

  1. eLife Assessment

    This valuable study introduces a data augmentation approach based on generative unsupervised models to address data imbalance in immune receptor modeling. Support for the findings is solid, showing that the use of generated data increases the performance of downstream supervised prediction tasks, e.g., TCR-peptide interaction prediction. However, the validation, mainly relying on synthetic data, could be completed, especially regarding unseen epitopes, and given the exclusive focus on CDR3β. The results should be of interest to the communities working on immunology and biological sequence data analysis.

  2. Reviewer #1 (Public review):

    Summary:

    The manuscript presents a deep learning framework for predicting T cell receptor (TCR) binding to antigens (peptide-MHC) using a combination of data augmentation techniques to address class imbalance in experimental datasets, and introduces both peptide-specific and pan-specific models for TCR-MHC-I binding prediction. The authors leverage a large, curated dataset of experimentally validated TCR-MHC-I pairs and apply a data augmentation strategy based on generative modeling to generate new TCR sequences. The approach is evaluated on benchmark datasets, and the resulting models demonstrate improved accuracy and robustness.

    Strengths:

    The most significant contribution of the manuscript lies in its data augmentation approach to mitigate class imbalance, particularly for rare but immunologically relevant epitope classes. The authors employ a generative strategy based on two deep learning architectures:

    (1) a Restricted Boltzmann Machine (RBM) and

    (2) a BERT-based language model, which is used to generate new CDR3B sequences of TCRs that are used as synthetic training data for creating a class balance of TCR-pMHC binding pairs.

    The distinction between peptide-specific (HLA allele-specific) and pan-specific (generalized across HLA alleles) models is well-motivated and addresses a key challenge in immunogenomics: balancing specificity and generalizability. The peptide-specific models show strong performance on known HLA alleles, which is expected, but the pan-specific model's ability to generalize across diverse HLA types, especially those not represented in training, is critical.

    Weaknesses:

    The paper would benefit from a more rigorous analysis of the biological validity of the augmented data. Specifically, how do the synthetic CDR3B sequences compare to real CDR3B in terms of sequence similarity, motif conservation? The authors should provide a quantitative assessment (via t-SNE or UMAP projections) of real vs. augmented sequences, or by measuring the overlap in known motif positions, before and after augmentation. Without such validation, the risk of introducing "hallucinated" sequences that distort model learning remains a concern. Moreover, it would strengthen the argument if the authors demonstrated that performance gains are not merely due to overfitting on synthetic data, but reflect genuine generalization to unseen real data. Ultimately, this can only be performed through elaborate experimental wet-lab validation experiments, which may be outside the scope of this study.

    While generative modeling for sequence data is increasingly common, the choice of RBM, which is a relatively older architecture, could benefit from stronger justification, especially given the emergence of more powerful and scalable alternatives (e.g., ProGen, ESM, or diffusion-based models). While BERT was used, it will be valuable in the future to explore other architectures for data augmentation.

    The manuscript would be more compelling if the authors performed a deeper analysis of the pan-specific model's behavior across HLA supertypes and allele groups. Are the learned representations truly "pan" or merely a weighted average of the most common alleles? The authors should assess whether the pan-specific model learns shared binding motifs (anchor residue preferences) and whether these features are interpretable through attention maps. A failure to identify such patterns would raise concerns about the model's interpretability and biological relevance.

    The exclusive focus on CDR3β for TCR modeling is biologically problematic. TCRs are heterodimers composed of α and β chains, and both CDR1, CDR2, and CDR3 regions of both chains contribute to antigen recognition. The CDR3β loop is often more diverse and critical, but CDR3α and the CDR1/2 loops also play significant roles in binding affinity and specificity. By generating only CDR3B sequences and not modeling the full TCR αβ heterodimer, the authors risk introducing a systematic bias toward β-chain-dominated recognition, which will not reflect the full complexity of TCR-peptide-MHC interactions.

  3. Reviewer #2 (Public review):

    Summary:

    This paper presents a thoughtful and well-motivated strategy to address a major challenge in TCR-epitope binding prediction: data imbalance, particularly the scarcity of positive (binding) TCR, peptide pairs. The authors introduce a two-step pipeline combining data balancing, via undersampling and generative augmentation, and a supervised CNN-based classifier. Notably, the use of Restricted Boltzmann Machines (RBMs) and BERT-style transformer models to generate synthetic CDR3β sequences is shown to improve model performance. The proposed method is applied to both peptide-specific and pan-specific settings, yielding notable performance improvements, especially for in-distribution peptides. Generative augmentation also leads to measurable gains for out-of-distribution epitopes, particularly those with high sequence similarity to the training set.

    Strengths:

    (1) The authors tackle the well-known but under-addressed issue of class imbalance in TCR-epitope binding data, where negatives vastly outnumber positive (binding) pairs. This imbalance undermines classifier reliability and generalization.

    (2) The model is tested on both in-distribution (seen epitopes) and out-of-distribution (unseen epitopes) scenarios. Including a synthetic lattice protein benchmark allows the authors to dissect generalization behavior in a controlled environment.

    (3) The paper shows a measurable benefit of generative. For example, AUC improvements of up to +0.11 are observed for peptides closely related to those seen during training, demonstrating the method's practical impact.

    (4) A direct comparison between RBM- and Transformer-based sequence generators adds value, offering the community guidance on trade-offs between different generative architectures in TCR modeling applications.

    Weaknesses:

    (1) Generalization degrades with epitope dissimilarity

    The performance drops substantially as the test epitope becomes more dissimilar to the training set. This is expected, but it highlights an essential limitation of the generative models: they help only when the test epitope is similar to one already seen. Table 1 shows that the performance gain from generative augmentation decreases as the test epitope becomes more dissimilar to the training epitopes. For epitopes with a Levenshtein distance of 1 from the training set, the average AUC improvement is approximately +0.11. This gain drops to around +0.06 for epitopes at distance 2. It becomes minimal for those at distance 4, indicating an explicit limitation in the model's ability to generalize to more distant epitopes. The authors should quantify more explicitly how far the model can generalize effectively. What is the performance degradation threshold as a function of Levenshtein distance?

    (2) What is the minimal number of positive samples needed for data augmentation to help?

    The approach has an intrinsic catch-22: generative models require data to learn the underlying distribution and cannot be applied to epitopes with insufficient data. As a result, the method is unlikely to be effective for completely new epitopes. Could the authors quantify the minimum number of real binders needed for effective generative augmentation? This would be particularly relevant for zero-shot or few-shot prediction scenarios, where only 0-10 positive samples are available. Such experiments would help clarify the practical limits of the proposed strategy.

    (3) Lack of end-to-end evaluation on unseen epitopes as inputs

    The authors frame peptide-specific models as classification over a few known epitopes, a closed-set formulation. While this is useful for evaluating generation effects, it's not representative of the more practical open-set task of predicting binding to truly novel epitopes. A stronger test would include models that take peptides as input (e.g., pan-specific, peptide-conditioned classifiers), including unseen epitopes at test time. Could the authors attempt an evaluation on benchmarks like IMMREP25 or other datasets where test epitopes are excluded from training?

    (4) Focus on β-chain limits generalizability

    The current pipeline is trained exclusively on CDR3β sequences. However, the field is increasingly moving toward single-cell sequencing, which provides paired α/β TCR chain data. Understanding how the proposed approach performs when both chains are available would be valuable. Could the authors evaluate the performance gains on paired α/β information, even in a small subset of single-cell data?

    (5) Synthetic lattice proteins (LPs) have limited biological fidelity

    While the LP-based benchmark presented in Figure 5 is a clever and controlled tool for probing model generalization, it remains conceptually and biophysically distant from real TCR-peptide interactions. Its utility as a toy model is valid, but its limitations should be more explicitly acknowledged:

    a) Over-simplified binding landscape: The LP system is designed for tractability, with a simplified sequence-structure mapping and fixed lattice constraints. As shown in Figure 5c, the LP binding landscape is linearly separable, in stark contrast to the complex and often degenerate nature of real TCR-epitope interactions, where multiple structurally distinct TCRs can bind the same peptide and vice versa.

    b) Absence of immunological context: The LP model abstracts away key biological factors such as MHC restriction, α/β chain pairing, peptide presentation, and structural constraints of the TCR-pMHC complex. These are essential for understanding binding specificity in actual immune repertoires.

    c) Overestimation of generalization: While performance drops on more distant LP structures, even these are structurally and statistically more similar to the training data than truly novel biological epitopes. Thus, the LP benchmark likely underestimates the true difficulty of out-of-distribution generalization in real-world TCR prediction tasks.

    d) Simplified biophysics: The LP simulations rely on coarse-grained energy models and empirical potentials that do not capture conformational dynamics, side-chain flexibility, or realistic binding energetics of TCR-peptide interfaces.

    In summary, while the LP benchmark helps isolate specific generalization behaviors and for sanity-checking model performance under controlled perturbations, its biological relevance is limited. The authors should explicitly frame these assumptions and limitations to prevent overinterpreting results from this synthetic system.

  4. Reviewer #3 (Public review):

    Summary:

    The authors present a method to address class imbalance in T cell receptor (TCR)-epitope binding datasets by generating synthetic positive binding examples using generative models, specifically BERT-based architectures and Restricted Boltzmann Machines (RBMs). They hypothesize that improving class balance can enhance model performance in predicting TCR-peptide binding.

    Strengths:

    (1) Interesting biological as well as technical topic.

    (2) Solid technical foundations.

    Weaknesses:

    (1) Fundamental Biological Oversight:

    While the computational strategy of augmenting positive samples via generative models is technically interesting, the manuscript falls short in addressing key biological considerations. Specifically, the authors simulate and evaluate only CDR3β-peptide binding interactions. However, antigen recognition by T cells involves both the α- and β-chains of the TCR. The omission of CDR3α undermines the biological realism and limits the generalizability of the findings.

    (2) Validation of Simulated Data:

    The central claim of the manuscript is that simulated positive examples improve predictive performance. However, there is no rigorous validation of the biological plausibility or realism of the generated TCR sequences. Without independent evaluation (e.g., testing whether synthetic TCR-peptide pairs are truly binding), it remains unclear whether the performance gains are biologically meaningful or merely reflect artifacts of the generation process.

    (3) Risk of Bias and Overfitting:

    Training and evaluating models with generated data introduces a risk of circularity and bias. The observed improvements may not reflect better generalization to real-world TCR-epitope interactions but could instead arise from overfitting to synthetic patterns. Additional testing on independent, biologically validated datasets would help clarify this point.

  5. Author response:

    We would like to thank editors and reviewers for their time spent on our work, fair assessments and constructive criticism. We plan to address their concerns in the future revision as follows, detailed by topic.

    (1) Limitations of focusing on CDR3β only

    In its current state, our work tested the proposed pipeline of data augmentation for binding prediction on benchmark datasets limited to peptide+CDR3β sequence pairs only. As pointed out by all the reviewers, the TCR-peptide interaction is more complex and involves also other regions of the receptor (such as the CDR3α chain) and the MHC presenting the peptide as well. To investigate how the inclusion of additional information impacts results, we plan to apply our pipeline in a setting where the generative protocol is extended to generate paired α and β. The supervised classifier will then receive a concatenation of α+β chains as inputs. We will compare the performance of this classifier with the one using β chains only, and add this analysis to the revised manuscript.

    (1) Validation of generated sequences and interpretation of the features learned by the generative model

    The reliability of the generative model in augmenting the training set with biologically sensible sequences is a crucial assumption of our approach, and we agree with the reviewers raising this as a main concern. Before stating our strategy to improve the soundness of the method, let us first point out a few aspects already considered in the present manuscript:

    • The test set of the classifier is always composed of real sequences: in this way, an increase in performance due to data augmentation cannot be due to overfitting to synthetic, possibly unrealistic, sequences.

    • The generative protocol is initialized from real sequences, and used to generate sequences not too far from them. In this respect, it could be taken as a way to “regularize” the simplest strategy of data augmentation, random oversampling (taking multiple copies of sequences at random to rebalance the data). This procedure avoids generating “wildly hallucinated” sequences with unreliable models. We will better quantify this statement (see below).

    • The training protocol is tailored to push the generative model towards learning binding features between peptide and CDR3β sequences (and not merely fitting their local statistics separately). For example, in the pan-specific setting, during training of the generative model on peptide+CDR3β sequences, the masked language modeling task is modified to force the model to recover the missing amino acid using only the other sequence context.

    We will better stress these points in the revised manuscript. To further validate the generative protocol in the future revision, we will carry out additional sanity checks on the generated data to confirm that the synthetic sequences remain biologically plausible and comparable to real ones.

    (1) Assessment of the performance of the pan-specific protocol for out-of-distribution data:

    To better clarify how the degradation in performance of a classifier tested on out-of-distribution data is impacted by the dissimilarity between test and training data distribution, we will improve the synthetic analysis currently reported in Table 1, adding confidence intervals for accuracy, quantifying thresholds on the distance for the method to work, providing t-SNE embeddings of in- and out-of distribution data.

    (2) Quantification of the threshold for the number of examples per class in order to train the generative model and obtain a performance increase

    In the paper, we adopted an operative common-sense threshold of at least 100 sequences per class in order to apply our data augmentation pipeline. We will quantify this effect testing this threshold in the revised manuscript, in order to (i) emphasize the limits of this two-step generative protocol in the low-data regime and to (ii) assess if the generative model falls back to a random oversampling strategy (due to strong overfitting) when few data are available for training.

    (3) Motivation for the use of RBMs:

    While RBMs have known limitations, their use in our pipeline (together with the more modern TCR-BERT, that we also test) is mainly motivated by the fact that they provide measurable increases in performance with data augmentation despite their simple 2-layer architecture. We stress that simpler generative (profile) models are unable to show this increase, see Appendix 3. In this respect, the RBM provides a minimal generative model allowing us to augment data successfully, and a lower bound to the increase of performance with respect to more complex architectures trained on more data. We will report this point of view in the text.

    (4) Clarification on the role of lattice proteins as an oversimplified toy model for protein interaction

    We agree with the points raised by Reviewer #2 on the limitations of lattice proteins as a model for protein interaction. Indeed, we used it merely as a toy model for phenomenology, a strategy whose validity has been fairly acknowledged by the reviewer. We will report in the main text all the drastic simplifications and reasons why the reader should take the comparison to real data with great care.