Repertoire-level generation of T-cell epitopes with a large-scale generative transformer

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Single-cell TCR sequencing enables high-resolution analysis of T Cell Receptor (TCR) diversity and clonality, offering valuable insights into immune responses and disease mechanisms. However, identifying cognate epitopes for individual TCRs requires complex and costly functional assays. We address this challenge with EpitopeGen, a large-scale transformer model based on the GPT-2 architecture that generates potential cognate epitope sequences directly from TCR sequences. To overcome the scarcity of TCR-epitope binding pairs (≈ 100,000), EpitopeGen uses a semi-supervised learning method, termed BINDSEARCH, which searches over 70 billion potential pairs and incorporates high binding affinity predictions as pseudo-labels. To incorporate CD8 + T cell biology into the model as an inductive bias, EpitopeGen employs a novel data balancing method, termed Antigen Category Filter, that carefully controls antigen category ratios in its training dataset. EpitopeGen significantly outperforms baseline approaches, generating epitopes with high binding affinity, diversity, naturalness, and biophysical stability. Notably, the epitopes generated by EpitopeGen follow biologically plausible antigen category distributions, a crucial feature not achieved by other models. Using EpitopeGen, we directly identify subsets of clonally expanded tumor-infiltrating lymphocytes that recognize tumor-associated antigens, exhibiting elevated cytotoxicity and reduced exhaustion markers. From COVID-19 patients, EpitopeGen detects T cells that recognize COVID-19 spike proteins and non-structural proteins with distinct transcriptomic characteristics. In conclusion, EpitopeGen represents the first computational method that enables direct inference of antigen recognition profiles of CD8 + T cells from plain TCR repertoires.

Article activity feed