Revealing the hidden sequence distribution of epitope-specific TCR repertoires and its influence on machine learning model performance
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Numerous efforts have been made to decipher the epitope-T cell receptor (TCR) recognition code. Both simple machine learning techniques and deep learning strategies have been used to train models to predict the binding of epitopes by TCR sequences. A good training data set rests at the basis of every accurate prediction model, yet little attention has been given to the composition of these data sets. In this paper, we studied the natural distribution of TCR sequences within epitope-specific TCR repertoires, i.e. a set of TCRs binding the same epitope, and its impact on the predictability of TCR-epitope interactions. We found that the observed diversity of these repertoires can result from a smaller set of core binding motifs constrained by TCR generation. Moreover, a clear relationship was found between the sequence distribution of the training data and performance metrics, emphasizing the importance of the used ground-truth data when using machine learning models in this domain. Taken together, these findings inform data set composition to help push epitope-TCR prediction models to the next level.