TransHLA: A Hybrid Transformer Model for HLA-Presented Epitope Detection

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Precise prediction of epitope presentation on human leukocyte antigen (HLA) molecules is crucial for advancing vaccine development and immunotherapy. Conventional HLA-peptide binding affinity prediction tools often focus on specific alleles and lack a universal approach for comprehensive HLA site analysis. This limitation hinders efficient filtering of invalid peptide segments.

Results

We introduce TransHLA, a pioneering tool designed for epitope prediction across all HLA alleles, integrating Transformer and Residue CNN architectures. TransHLA utilizes the ESM2 large language model for sequence and structure embeddings, achieving high predictive accuracy. For HLA class I, it reaches an accuracy of 84.72% and an AUC of 91.95% on IEDB test data. For HLA class II, it achieves 79.94% accuracy and an AUC of 88.14%. Our case studies using datasets like CEDAR and VDJdb demonstrate that TransHLA surpasses existing models in specificity and sensitivity for identifying immunogenic epitopes and neoepitopes.

Conclusions

TransHLA significantly enhances vaccine design and immunotherapy by efficiently identifying broadly reactive peptides. Our resources, including data and code, are publicly accessible at https://github.com/SkywalkerLuke/TransHLA

Key Points

  • We developed TransHLA, a deep learning tool for predicting epitopes across all HLA alleles using Transformer and Residue CNN architectures.

  • The model uses ESM2 embeddings to improve predictive accuracy and efficiency.

  • TransHLA shows superior specificity and sensitivity in identifying immunogenic epitopes and neoepitopes compared to existing models.

  • Our approach offers potential advancements in vaccine design and immunotherapy through enhanced peptide analysis.

Article activity feed

  1. AbstractBackground Precise prediction of epitope presentation on human leukocyte antigen (HLA) molecules is crucial for advancing vaccine development and immunotherapy. Conventional HLA-peptide binding affinity prediction tools often focus on specific alleles and lack a universal approach for comprehensive HLA site analysis. This limitation hinders efficient filtering of invalid peptide segments.Results We introduce TransHLA, a pioneering tool designed for epitope prediction across all HLA alleles, integrating Transformer and Residue CNN architectures. TransHLA utilizes the ESM2 large language model for sequence and structure embeddings, achieving high predictive accuracy. For HLA class I, it reaches an accuracy of 84.72% and an AUC of 91.95% on IEDB test data. For HLA class II, it achieves 79.94% accuracy and an AUC of 88.14%. Our case studies using datasets like CEDAR and VDJdb demonstrate that TransHLA surpasses existing models in specificity and sensitivity for identifying immunogenic epitopes and neoepitopes.Conclusions TransHLA significantly enhances vaccine design and immunotherapy by efficiently identifying broadly reactive peptides. Our resources, including data and code, are publicly accessible at https://github.com/SkywalkerLuke/TransHLA

    This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf008), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2: Markus Müller

    The authors present TransHLA, a deep learning tool to predict whether a peptide is an HLA binder or not. They use the ESM2 language model to create peptide embeddings for structural and sequence features and then use transformers and CNNs for the binding prediction. The article is well-written and clear. However, the authors must better justify the choice of their model and its potential application.

    Major comments:

    1. In personalized medicine, the HLA alleles of a patient can be obtained via WES and there is no need for such a HLA agnostic binding predictor. Could you briefly outline the most important medical applications where your TransHLA predictor could be most useful?

    2. Could you give more information about your IEDB training set? What are the frequencies of the HLA alleles, and the number of peptides per allele? How did you perform the splits into training, validation, and test sets? Were peptides from the same allele all present in all 3 sets? How does TransHLA perform for peptides binding to alleles not present in the training set compared to peptides binding to alleles present in the training set? How does the performance depend on the number of peptides of the allele in the training set? Is the model biased to these frequent alleles?

    3. Peptides are processed by many steps before being presented on HLA molecules. These include cleavage in the proteasome, transport via TAP to the ER, cleavage by ERADs and finally loading on the HLA complex. Why don't you perform your study on extended peptide sequences, where you take into account several amino acids before and after the peptide termini? Like this, you could also include the other processing steps. It would be interesting to see whether this sequence extension would improve prediction.

    4. Could you compare your approach with a 'simpler' approach, where you calculate all biopython features (such as flexibility), ev. choose the n most informative ones by feature selection, and use a standard classifier such as logistic regression or XGBoost to predict the HLA binding. This method has the advantage that it tells you directly which features are most relevant.

    5. Please provide the results of the ablation study in a table in the main text, where you compare the ablated models to the base model.

    6. Could you briefly explain what the different terms in the TIM loss are and why they are important?

    7. Does the flexibility depend on the length of the peptides? Peptides longer than 10 often bulge out of the binding groove, and naively one would expect them to be less stiff than peptides of length 8 or 9.

    Minor:

    1. In Equation 10, please define ^p_k. In the text, you use T for the number of classes, in the formulae K.
  2. TransHLA: A Hybrid Transformer Model for HLA-Presented Epitope Detection

    This work has been peer reviewed in GigaScience (https://doi.org/10.1093/gigascience/giaf008), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    **Reviewer 1: Georgios Fotakis **

    1. General Comments In this manuscript, the authors present TransHLA, a hybrid transformer model that integrates a transformer-based language model with a deep Convolutional Neural Network (CNN) module. The transformer encoder module leverages a pre-trained large language model (Evolutionary Scale Modeling - ESM2) to extract global features using a multi-head attention mechanism. The feature extraction is further enhanced by two consecutive CNN modules, maximizing the mutual information between query features (sequences) and their label predictions (epitope/non-epitope) through a modified Transductive Information Maximization (TIM) loss function. TransHLA is designed to collectively consider all HLA sites across all alleles and is the first neoantigen prediction tool of its kind, since it does not require HLA alleles as input. The authors also present benchmark study results, showcasing the increased predictive accuracy of TransHLA and its potential as a valuable pre-screening tool.

    The computational method presented in this manuscript demonstrates a strong scientific foundation and shows promise for future refinement and extension, suggesting significant potential for meaningful research output. However, there are some conceptual and technical concerns that need to be addressed.

    1. Specific comments for revision a) Major Manuscript: i) Introduction
    • The authors distinguish between two categories of models: those that need only epitopes as input and those that require both epitopes and HLA alleles as inputs. However, the basis for this classification is unclear. For instance, MHCNuggets and DeepSeqPanII, cited as examples of the first category, actually require both an allele and an epitope to predict neoantigens. This is supported by the algorithms' manuals and the supplementary material provided by the authors, where they specify the need for HLA alleles to execute the commands.

    • The authors state: "Considering that TransHLA is the first epitope prediction software that does not impose restrictions on HLA alleles" This needs clarification, as all available "pan-allele" models do not impose restrictions on HLA alleles (the models are trained on nearly all sequenced HLAs). Perhaps the authors meant that TransHLA does not require HLA alleles as input?

    ii) Results

    • The reason for conducting two separate benchmarks (case study and validation) with different HLA binding affinity predictors is unclear. For instance, it is not explained why netMHCpan/netMHCpanII were not included in the first benchmark and only used in the validation part.

    • It would be very informative if the authors were able to include other widely used HLA binding affinity predictors in their benchmarks, such as mixMHCpred and mixMHCpred2.

    • The authors state: "the details information of alleles used in each tool can be found in the Supplementary File" However, no information about the alleles used in this study is provided (or at least it was not made available to me at the time of reviewing this version of the manuscript).

    • The "protein structural flexibility" should be briefly explained and properly cited (Vihinen et al., 1994, Proteins, 19(2), 141-149).

    iii) Conclusion and Discussion

    • The authors claim that TransHLA alleviates "the restrictive requirement of knowing the specific HLA alleles." However, this is not typically a restriction, as serological typing of HLA is routinely performed in clinics, and samples usually come with relevant metadata. Additionally, HLA typing can be easily performed with RNAseq and/or WES data, the same data usually required to produce the putative epitopes initially, with high accuracy (e.g., OptiType can reach 93.5% [CI95: 91.8-95.1%] accuracy for HLA class I). Therefore, this information is generally readily available for processing. While the authors effectively demonstrate the accuracy of TransHLA, they fail to clarify the context in which this computational tool could be utilized.

    • To the best of my knowledge, in the research field of personalized medicine, neoantigen vaccines are typically produced at the patient level, taking the patients' HLA alleles into consideration. Binding affinity, by definition, can quantitatively differentiate between strong (low IC50) and weak (high IC50) binders. Thus, binding affinity predictions are a pivotal step for neoantigen prioritization. Given that the authors suggest TransHLA as an "alternative for filtering potential epitopes", how would TransHLA perform in such situations? To enhance clarity, the authors should elaborate on a scenario where TransHLA would be a superior choice compared to high-performing HLA binding affinity predictors in this context.

    • The authors mention in the introduction that TransHLA can be used to "expedite the precise screening of peptides". Additionally, in their GitHub repository it is stated that TransHLA "can serve as a preliminary screening for the currently popular tools that are specific for HLA-epitope binding affinity", which is quite accurate. They might consider incorporating this concept into their concluding remarks as well.

    Implementation:

    • Since neoantigen prediction is typically carried out using computational pipelines, it would be very helpful if the authors could provide instructions for end-users to install the software and its dependencies in isolated (contained) computational environments. To enhance clarity, I am attaching the files I used to create these environments via Conda (transhla_env.yaml), Singularity (TransHLA.def), and Docker (Dockerfile).

    • Following the previous point, the authors should consider providing a CLI (similar to the "train.py" and "inference.py" scripts in their GitHub repository) to enhance the software's usability in computational pipelines. As an example, I am attaching the script I used to test the software (TransHLA.py).

    b) Minor

    • It would enhance the clarity (especially for readers who are not familiar with artificial intelligence) if the authors would briefly explain each technical term and then use the abbreviations. For example, "Evolutionary Scale Modeling (ESM2)" and so on.

    • Additionally, the manuscript and its supplementary material contain several grammatical and spelling errors that need to be rectified.