Machine Learning Dataset and Benchmark for Accurate T Cell Receptor-pHLA Binding Prediction
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
A central challenge in immunology and therapeutic design is accurately predicting the diverse interactions between T cell receptors (TCRs) and peptide-HLA (pHLA) complexes. Existing machine learning tools are hindered by incomplete sequence data and biased non-binding examples. To overcome this, we present Hi-TPH, a large-scale hierarchical dataset featuring an on-the-fly selection strategy for generating non-binding data. We further develop Hi-TPH-PLMs, a collection of Protein Language Models (PLMs) with varied architectures and scales, fine-tuned on Hi-TPH. These models achieve a 17.4% performance gain over state-of-the-art tools on an external wet-lab test set. Leveraging the hierarchical structure of Hi-TPH, detailed analyses dissect the contribution of different molecular components to binding prediction and reveal their synergistic interplay—for instance, the prediction contribution of HLA relies on the presence of full TCR chains. Hi-TPH and Hi-TPH-PLMs are publicly released to support the development of more reliable tools for advanced immunoinformatics research and personalized immunotherapy.