Combining genotypes and T cell receptor distributions to infer genetic loci determining V(D)J recombination probabilities

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This study demonstrates that differences in areas outside the regions that encode the TCR genes can affect the properties of TCRs that get made. This paper will be of interest to a broad swathe of immunologists who study such variable lymphocyte receptors. It combines several large datasets in an extremely statistically rigorous analysis, producing results consistent with but substantially expanding upon the prior knowledge of the field.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their names with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Every T cell receptor (TCR) repertoire is shaped by a complex probabilistic tangle of genetically determined biases and immune exposures. T cells combine a random V(D)J recombination process with a selection process to generate highly diverse and functional TCRs. The extent to which an individual’s genetic background is associated with their resulting TCR repertoire diversity has yet to be fully explored. Using a previously published repertoire sequencing dataset paired with high-resolution genome-wide genotyping from a large human cohort, we infer specific genetic loci associated with V(D)J recombination probabilities using genome-wide association inference. We show that V(D)J gene usage profiles are associated with variation in the TCRB locus and, specifically for the functional TCR repertoire, variation in the major histocompatibility complex locus. Further, we identify specific variations in the genes encoding the Artemis protein and the TdT protein to be associated with biasing junctional nucleotide deletion and N-insertion, respectively. These results refine our understanding of genetically-determined TCR repertoire biases by confirming and extending previous studies on the genetic determinants of V(D)J gene usage and providing the first examples of trans genetic variants which are associated with modifying junctional diversity. Together, these insights lay the groundwork for further explorations into how immune responses vary between individuals.

Article activity feed

  1. Author Response:

    Reviewer #2:

    In this study, Russell et al. combined T cell receptor (TCR) repertoire sequencing data with SNP array genotype data to infer genetic polymorphisms which impact upon the process of TCR generation. Using these data, the authors looked for loci with polymorphisms which associate with different V(D)J recombination probabilities, i.e. sites in the genome which impact upon the chances of TCRs with different properties being produced when they change.

    Beyond the expected sites in the TCR and MHC loci, the authors observed strong associations with distant sites. One was with DCLRE1C, which encodes Artemis, the endonuclease responsible for cutting the TCR loci during recombination, while the second was DNTT, the site encoding the enzyme Tdt, which is responsible for addition of nucleotides to cut V(D)J during recombination. This is the first time that such SNP associations have been described to my knowledge, and yet make perfect sense: DCLRE1C variations were associated with the amount of trimming V and J genes underwent during recombination, while DNTT polymorphisms associated with the number of inserted nucleotides. The authors also report, after assigning donors an associated ancestry based on clustering of their genotype data, that certain inferred ancestries associate with different TCR repertoire properties. In this analysis 'Asian-associated' TCR repertoires had fewer non-templated nucleotide insertions, along with a corresponding greater incidence of the DNTT polymorphisms associated with differences in insertions, relative to other groups.

    Strengths:

    This manuscript is exceedingly well written. Both the TCR biology and the statistical considerations of the genetic analyses are extremely complex topics, mired in arcane terminology, which often end up somewhat impenetrable to non-expert readers. However both have been introduced and explained with admirable clarity throughout, including the caveats and implications of analyses that would not be intuitive to many readers not already expert in both fields.

    As best I can determine, the analyses themselves are also extremely rigorous, with each step carefully taken and justified in the text, involving numerous corrections at multiple scales (e.g. for TCR productivity, TCR gene usage, specific TRDB2 genotype, population substructure, and more). The major findings have also been validated in a completely separate cohort, using a different analysis pipeline. While the authors point out that such genome-wide association efforts looking at TCR gene expression have been undertaken before, the major innovation presented here lies in applying those data to investigating specific V(D)J recombination probabilities. Thus the findings are novel, and the conclusions well supported by the data.

    The data visualisation have all been plotted in a sensible and easily interpretable manner. The majority of data themselves are all already publicly available, having been published in prior studies. The TCRseq data for the validation code has been assigned a BioProject accession, which I presume will go live at the time of publication. The code is also appropriately hosted on Github, and are mostly adequately commented and documented enough so as to be repeatable.

    Thank you!

    Limitations:

    There are very few if any obvious technical limitations or weaknesses that I can see that are not intrinsic to the data themselves. While the authors do mention these limitations, I wonder if they should be devoted some more attention somewhere in the text of the manuscript; relatively few researchers are expert in both TCR biology and the technicalities of genome-wide association studies, so I think more explicit consideration of these issues would be helpful.

    We have expanded the section containing limitations of our approach within the discussion section. We hope this addition clarifies the intrinsic limitations of the data used here.

    In particular, I think the difficulty of studying these loci with standard techniques could be underlined, along with what implications that might have for this study. The highly repetitive nature of the TCR loci can certainly make any analysis looking at short sequences problematic, which has implications for both the TCRseq and genotyping aspects of this study. Combined with the fact that most studies focus on certain populations, polymorphisms in the TCR loci are very likely being relatively undersampled by the field (a hypothesis supported by the ongoing discovery of novel exonic polymorphisms in TCRseq data itself, e.g. as demonstrated in this pre-print by Omer et al. (https://doi.org/10.1101/2021.05.17.444409). The consequences of SNP polymorphism coverage in SNP arrays has already been considered for IgH (https://doi.org/10.1038/gene.2012.12): while this is an admittedly more polymorphic locus, the underlying causes of these issues are mostly all true of the TCR loci as well. Similarly, while the authors do appropriately point out that issues with V(D)J gene assignment could infer biases it may be worth noting that the TCRseq technology used to produce their main dataset uses relatively short read sequencing, that is unable to distinguish a substantial fraction of even known TCR gene- and allele-level diversity (see Fig. 1C of the Omer et al. pre-print). Thus there may be a whole dimension of TCR polymorphism that is not well captured by either platform.

    This is a great suggestion and we have added a section within the discussion to mention these limitations and their implications for both the SNP array and TCR repertoire sequencing data used here.

    Overall, I think this is an extremely considered and digestible study, which will be of great interest across and beyond the field. As the wider community comes to grips with how best to incorporate TCR and BCR polymorphisms into their analyses of the adaptive immune loci themselves (and how this might impact upon recombination, expression, and downstream immune functions) this serves as a timely reminder that we should not forget the polymorphisms elsewhere in the genome that might also be relevant.

    Thank you!

    Reviewer #3:

    In this manuscript, Russel et al propose an inference method to link genetic variations with TCR repertoire feature variations, based on observations from previous studies showing similarities at various level of the repertoire in monozygotic twins. To that end, they used a unique publically available dataset, which combines TCRb immunosequencing data as well as whole genome SNPs data. The method is elegant and sheds light on the importance of combining different type of data to better understand the complexity of TCR repertoire generation and selection. However, unfortunately, while their discovery data set provide some associations between SNPs and TCR repertoire features, they were almost unable to recapitulate the results with their validation dataset. The main reasons could be that the donor demographics are highly divergent between the two cohorts (81% Caucasian in the discovery vs. mainly Hispanic in the validation), the immunosequencing data were generated using RNA based method for the validation while the discovery dataset was obtained from gDNA templates and finally the SNPs array were discordant between the two datasets. Nonetheless, the approach and the study deserve attention and might be improved by additional experiments or analyses and by providing additional information.

    Thank you for your review. We would like to emphasize that the validation results reported here are as good as one might expect given the small sample size of the validation cohort (94 individuals) and the discordance between the discovery and validation SNP sets. The overlap between the discovery cohort and the validation cohort SNP sets consisted of just two significant SNPs, one within the gene encoding the Artemis protein (DCLRE1C) and the other within the gene encoding the TdT protein (DNTT). This DCLRE1C SNP (rs12768894, c.728A>G) was strongly associated with the extent of V-gene and J-gene trimming in the discovery cohort, and we were able to successfully validate this finding within the validation cohort. Specifically, this DCLRE1C SNP was significantly associated with the extent of J-gene trimming in productive TCRalpha and TCRbeta chains and V-gene trimming of both productive and non-productive TCRalpha and TCRbeta chains within the validation cohort. The overlapping SNP within the DNTT locus (rs3762093) was only weakly associated with the extent of N-insertion within the discovery cohort, and as such, it was not surprising that this SNP only reached statistical significance for one of the N-insertion types (productive TCRalpha rearrangements; note that due to the lack of the D gene, N-insertion annotations are likely less noisy on the TCRalpha locus). Despite our inability to replicate all N-insertion associations, we noted that the model coefficients for rs3762093 genotype were in the same direction (i.e., the minor allele was associated with fewer N-insertions) for all N-insertion and productivity types within the TCRbeta chains for both cohorts.

  2. Evaluation Summary:

    This study demonstrates that differences in areas outside the regions that encode the TCR genes can affect the properties of TCRs that get made. This paper will be of interest to a broad swathe of immunologists who study such variable lymphocyte receptors. It combines several large datasets in an extremely statistically rigorous analysis, producing results consistent with but substantially expanding upon the prior knowledge of the field.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 and Reviewer #2 agreed to share their names with the authors.)

  3. Reviewer #1 (Public Review):

    The authors address the influence of genetic variation, captured as genome wide SNPs on multiple parameters of the T cell receptor repertoire, including differential variable gene usage, base pair deletions (trmming) and base pair additions at the VDJ junctions.

    The authors uncover signficant genetic associations between multiple SNPs in the T cell beta locus and the MHC on variable gene usage. They develop a model which conditions on gene usage to identify SNPs within the gene coding Artemis protein which associate with deletions; and other SNPs in the TdT gene which associate with additions. The authors attempt a validation cohort, although because of the lack of SNP overlap the validation observed is very limited.

    The study contributes to our understanding of the genetic control of the T cell receptor repertoire and its diversity. However, the advance in understanding of the process of repertoire creation is quite modest. Genetic effects on variable gene expression have been seen before. Genetic effects on basepair deletion and addition is new.

  4. Reviewer #2 (Public Review):

    In this study, Russell et al. combined T cell receptor (TCR) repertoire sequencing data with SNP array genotype data to infer genetic polymorphisms which impact upon the process of TCR generation. Using these data, the authors looked for loci with polymorphisms which associate with different V(D)J recombination probabilities, i.e. sites in the genome which impact upon the chances of TCRs with different properties being produced when they change.

    Beyond the expected sites in the TCR and MHC loci, the authors observed strong associations with distant sites. One was with DCLRE1C, which encodes Artemis, the endonuclease responsible for cutting the TCR loci during recombination, while the second was DNTT, the site encoding the enzyme Tdt, which is responsible for addition of nucleotides to cut V(D)J during recombination. This is the first time that such SNP associations have been described to my knowledge, and yet make perfect sense: DCLRE1C variations were associated with the amount of trimming V and J genes underwent during recombination, while DNTT polymorphisms associated with the number of inserted nucleotides. The authors also report, after assigning donors an associated ancestry based on clustering of their genotype data, that certain inferred ancestries associate with different TCR repertoire properties. In this analysis 'Asian-associated' TCR repertoires had fewer non-templated nucleotide insertions, along with a corresponding greater incidence of the DNTT polymorphisms associated with differences in insertions, relative to other groups.

    Strengths:

    This manuscript is exceedingly well written. Both the TCR biology and the statistical considerations of the genetic analyses are extremely complex topics, mired in arcane terminology, which often end up somewhat impenetrable to non-expert readers. However both have been introduced and explained with admirable clarity throughout, including the caveats and implications of analyses that would not be intuitive to many readers not already expert in both fields.

    As best I can determine, the analyses themselves are also extremely rigorous, with each step carefully taken and justified in the text, involving numerous corrections at multiple scales (e.g. for TCR productivity, TCR gene usage, specific TRDB2 genotype, population substructure, and more). The major findings have also been validated in a completely separate cohort, using a different analysis pipeline. While the authors point out that such genome-wide association efforts looking at TCR gene expression have been undertaken before, the major innovation presented here lies in applying those data to investigating specific V(D)J recombination probabilities. Thus the findings are novel, and the conclusions well supported by the data.

    The data visualisation have all been plotted in a sensible and easily interpretable manner. The majority of data themselves are all already publicly available, having been published in prior studies. The TCRseq data for the validation code has been assigned a BioProject accession, which I presume will go live at the time of publication. The code is also appropriately hosted on Github, and are mostly adequately commented and documented enough so as to be repeatable.

    Limitations:

    There are very few if any obvious technical limitations or weaknesses that I can see that are not intrinsic to the data themselves. While the authors do mention these limitations, I wonder if they should be devoted some more attention somewhere in the text of the manuscript; relatively few researchers are expert in both TCR biology and the technicalities of genome-wide association studies, so I think more explicit consideration of these issues would be helpful.

    In particular, I think the difficulty of studying these loci with standard techniques could be underlined, along with what implications that might have for this study. The highly repetitive nature of the TCR loci can certainly make any analysis looking at short sequences problematic, which has implications for both the TCRseq and genotyping aspects of this study. Combined with the fact that most studies focus on certain populations, polymorphisms in the TCR loci are very likely being relatively undersampled by the field (a hypothesis supported by the ongoing discovery of novel exonic polymorphisms in TCRseq data itself, e.g. as demonstrated in this pre-print by Omer et al. https://doi.org/10.1101/2021.05.17.444409). The consequences of SNP polymorphism coverage in SNP arrays has already been considered for IgH (https://doi.org/10.1038/gene.2012.12): while this is an admittedly more polymorphic locus, the underlying causes of these issues are mostly all true of the TCR loci as well. Similarly, while the authors do appropriately point out that issues with V(D)J gene assignment could infer biases it may be worth noting that the TCRseq technology used to produce their main dataset uses relatively short read sequencing, that is unable to distinguish a substantial fraction of even known TCR gene- and allele-level diversity (see Fig. 1C of the Omer et al. pre-print). Thus there may be a whole dimension of TCR polymorphism that is not well captured by either platform.

    Overall, I think this is an extremely considered and digestible study, which will be of great interest across and beyond the field. As the wider community comes to grips with how best to incorporate TCR and BCR polymorphisms into their analyses of the adaptive immune loci themselves (and how this might impact upon recombination, expression, and downstream immune functions) this serves as a timely reminder that we should not forget the polymorphisms elsewhere in the genome that might also be relevant.

  5. Reviewer #3 (Public Review):

    In this manuscript, Russel et al propose an inference method to link genetic variations with TCR repertoire feature variations, based on observations from previous studies showing similarities at various level of the repertoire in monozygotic twins. To that end, they used a unique publically available dataset, which combines TCRb immunosequencing data as well as whole genome SNPs data. The method is elegant and sheds light on the importance of combining different type of data to better understand the complexity of TCR repertoire generation and selection. However, unfortunately, while their discovery data set provide some associations between SNPs and TCR repertoire features, they were almost unable to recapitulate the results with their validation dataset. The main reasons could be that the donor demographics are highly divergent between the two cohorts (81% Caucasian in the discovery vs. mainly Hispanic in the validation), the immunosequencing data were generated using RNA based method for the validation while the discovery dataset was obtained from gDNA templates and finally the SNPs array were discordant between the two datasets. Nonetheless, the approach and the study deserve attention and might be improved by additional experiments or analyses and by providing additional information.