Assessing data size requirements for training generalizable sequence-based TCR specificity models via pan-allelic MHC-I non-self ligandome evaluation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Quickly identifying which T cell receptors (TCRs) specifically bind patient-unique neoepitopes is a critical challenge for personalized TCR cell therapy in oncology. Due to enormous diversity of both TCR and neoepitope repertoires, a machine learning predictor of TCR-pMHC specificity for personalized therapy must generalize to TCRs and epitopes not seen in the training data. For the first time, we estimate the necessary size of such training data. We first show that published models fail to generalize beyond a single-residue dissimilarity to the epitope training set distribution. We then impute the possible mutated ligandome across the 34 most prevalent human MHC alleles and represent it as a graph based on our established dissimilarity cutoff. By finding the dominating set of this graph, we estimate that between one and 100 million epitopes are required to train a generalizable sequence-based TCR specificity prediction model - 1000 times the size of current public data. *Antoine Delaunay & Miles McGibbon contributed equally to this work.

Article activity feed