Assessing data size requirements for training generalizable sequence-based TCR specificity models via pan-allelic MHC-I non-self ligandome evaluation
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Quickly identifying which T cell receptors (TCRs) specifically bind patient-unique neoepitopes is a critical challenge for personalized TCR cell therapy in oncology. Due to enormous diversity of both TCR and neoepitope repertoires, a machine learning predictor of TCR-pMHC specificity for personalized therapy must generalize to TCRs and epitopes not seen in the training data. For the first time, we estimate the necessary size of such training data. We first show that published models fail to generalize beyond a single-residue dissimilarity to the epitope training set distribution. We then impute the possible mutated ligandome across the 34 most prevalent human MHC alleles and represent it as a graph based on our established dissimilarity cutoff. By finding the dominating set of this graph, we estimate that between one and 100 million epitopes are required to train a generalizable sequence-based TCR specificity prediction model - 1000 times the size of current public data. *Antoine Delaunay & Miles McGibbon contributed equally to this work.