Combining mutation and recombination statistics to infer clonal families in antibody repertoires

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    This important study provides a new, apparently high-performance algorithm for B cell clonal family inference. The new algorithm is highly innovative and based on a rigorous probabilistic analysis of the relevant biological processes and their imprint on the resulting sequences, however, the strength of evidence regarding the algorithm's performance is incomplete, due to (1) a lack of clarity regarding how different data sets were used for different steps during algorithm development and validation, resulting in concerns of circularity, (2) a lack of detail regarding the settings for competitor programs during benchmarking, and (3) method development, data simulation for method validation, and empirical analyses all based on the B cell repertoire of a single subject. With clarity around these issues and application to a more diverse set of real samples, this paper could be fundamental to immunologists and important to any researcher or clinician utilizing B cell receptor repertoires in their field (e.g., cancer immunology).

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

B-cell repertoires are characterized by a diverse set of receptors of distinct specificities generated through two processes of somatic diversification: V(D)J recombination and somatic hypermutations. B cell clonal families stem from the same V(D)J recombination event, but differ in their hypermutations. Clonal families identification is key to understanding B-cell repertoire function, evolution and dynamics. We present HILARy (High-precision Inference of Lineages in Antibody Repertoires), an efficient, fast and precise method to identify clonal families from high-throughput sequencing datasets. HILARy combines probabilistic models that capture the receptor generation and selection statistics with adapted clustering methods to achieve consistently high inference accuracy. It automatically leverages the phylogenetic signal of shared mutations in difficult repertoire subsets. Exploiting the high sensitivity of the method, we find the statistics of evolutionary properties such as the site frequency spectrum and d N /d S ratio do not depend on the junction length. We also identify a broad range of selection pressures scanning two orders of magnitude.

Article activity feed

  1. Author Response

    eLife assessment

    This important study provides a new, apparently high-performance algorithm for B cell clonal family inference. The new algorithm is highly innovative and based on a rigorous probabilistic analysis of the relevant biological processes and their imprint on the resulting sequences, however, the strength of evidence regarding the algorithm's performance is incomplete, due to (1) a lack of clarity regarding how different data sets were used for different steps during algorithm development and validation, resulting in concerns of circularity, (2) a lack of detail regarding the settings for competitor programs during benchmarking, and (3) method development, data simulation for method validation, and empirical analyses all based on the B cell repertoire of a single subject. With clarity around these issues and application to a more diverse set of real samples, this paper could be fundamental to immunologists and important to any researcher or clinician utilizing B cell receptor repertoires in their field (e.g., cancer immunology).

    We apologize for the long delay in implementing the suggested changes. Some of the co-authors had some personal issues that made it hard to efficiently work on the revision.

    We have addressed all the essential points below, as well as all the detailed comments of each reviewer in the following pages.

    Due to the journal’s guidelines we have to upload an “all black” version of the manuscript as the main version. We have uploaded a revised manuscript with the changes marked in red as a “Related Manuscript file”, which appears at the very end of the Merged Manuscript File, after all the Figures, and at the end of the list of files on the webpage. We apologize for this inconvenience.

    In addition, we have added an extension of HILARy to deal with paired-chain repertoires, and have benchmarked the new method on a recently published synthetic dataset. This new analysis is now presented in new Fig. 5.

    Reviewer #1 (Public Review):

    Identifying individual BCR/Ab chain sequences that are members of the same clone is a longstanding problem in the analysis of BCR/Ab repertoire sequencing data. The authors propose a new method designed to be scalable for application to huge repertoire data sets without sacrificing accuracy. Their approach utilizes Hamming Distance between CDR3 sequences followed by clustering for a fast, high-precision approach to classifying pairs of sequences as related or not, and then refines the classification using mutation information from germline-encoded regions. They compare their method with other state-of-the-art methods using synthetic data.

    The authors address an important problem in an interesting, innovative, and rigorous way, using probabilistic representations of CDR3 differences, frequencies of shared and not-shared mutations, and the relationships between the two under hypotheses of related pairs and unrelated pairs, and from these develop an approach for determining thresholds for classification and lineage assignment. Benchmarking shows that the proposed method, the complete method including both steps, outperforms other methods.

    Strengths of the method include its theoretical underpinnings which are consistent with an immunologist's intuition about how related and unrelated sequences would compare with each other in terms of the metrics to use and how those metrics are related to each other.

    I have two high-level concerns:

    (1) It isn't clear how the real and synthetic data are being used to estimate parameters for the classifier and evaluate the classifier to avoid circularity. It seems like the approach is used to assign lineages in the data from [1], and then properties of this set of lineages are used to estimate parameters that are then used to refine the approach and generate synthetic data that is used to evaluate the approach. This may not be a problem with the approach but rather with its presentation, but it isn't entirely clear what data is being used and where for what purpose. An understanding of this is necessary in order to truly evaluate the method and results.

    The reviewer is correct in their understanding of the pipeline. It should be stressed that the lineages used to guide the generation of the synthetic data was done on VJl classes for which the clustering was easy and reliable, and should therefore be largely model independent.

    We have added an explanation in the main text of why the re-use of real data lineages inferred by HILARy doesn’t bias the procedure, since it’s done on a subset of lineages within VJl classes that are easy to infer (section “Test on synthetic dataset”).

    (2) Regarding the data used for benchmarking - given the intertwined fashion by which the classification approach and synthetic data generation approach appear to have been developed, it is not surprising that the proposed approach outperforms the other methods when evaluated on the synthetic data presented here. It would be better to include in the benchmark the data used by the other methods to benchmark themselves or also generate synthetic data using their data generation procedures.

    We agree with the reviewer that a test of the method on an independent synthetic dataset is important for its applicability and to compare to other methods.

    We have added a new synthetic dataset from the group that designed the partis method to our benchmark. Our method still performs competitively, on par with partis—which was developed and tested on that dataset—and better than other methods. The results are presented in revised Fig. 4 (panels E-G), and Figure 4–figure supplement 1 as a function of the mutation rate.

    In addition, we have used that dataset to benchmark a new version of HILARy that also uses the light chain. We present the results in new Figures 5 and Figure 4–figure supplement 1.

    An improved method for BCR/Ab sequence lineage assignment would be a methodologic advancement that would enable more rigorous analyses of BCR/Ab repertoires across many fields, including infectious disease, cancer, autoimmune disease, etc., and in turn, enable advancement in our understanding of humoral immune responses. The methods would have utility to a broad community of researchers.

    Reviewer #2 (Public Review):

    This manuscript describes a new algorithm for clonal family inference based on V and J gene identity, sequence divergence in the CDR3 region, and shared mutations outside the CDR3. Specifically, the algorithm starts by grouping sequences that have the same V and J genes and the same CDR3 length. It then performs single-linkage clustering on these groups based on CDR3 Hamming distance, then further refines these groups based on shared mutations.

    Although there are a number of algorithms that use a similar overall strategy, a couple of aspects make this work unique. First, a persistent challenge for algorithms such as this one is how to set a cutoff for single-linkage clustering: if it is too low, then one separates clusters that should be together, and if too high one joins together clusters that should be separate. Here the authors leverage a rich collection of probabilistic tools to make an optimal choice. Specifically, they model the probability distributions of within- and between-cluster CDR3 Hamming distances, with parameters depending on CDR3 length and the "prevalence" of clonal sequence pairs (i.e. family size distribution). This allows the algorithm to make optimal choices for separating clusters, given the particular chosen distance metric, and assuming the sample in question has been accurately modeled. Second, the algorithm uses a highly efficient means of doing single-linkage clustering on nucleotide sequences.

    This leads to a fast and highly performant algorithm on data meant to replicate the original sample used in algorithm design. The ideas are new and beautifully developed. The application to real data is interesting, especially the point about dN/dS.

    However, the paper leaves open the question of how this inference algorithm works on samples other than the one used for simulation and as a template for validation. If I understand the simulation procedure correctly - that one takes a collection of inferred trees from the real data, then re-draws the root sequence and the identity of the mutations on the branches - then the simulated data should be very close to the data used to develop the methods in the paper. This consideration seems especially important given that key methods in this paper use mutation counts and overall mutation counts are preserved.

    Repertoires come in all shapes and sizes: infants to adults, healthy to cancerous, and naive to memory to plasma-cell-just-after-vaccination. If this is being proposed as a general-purpose clonal inference algorithm rather than one just for this sample, then a more diverse set of validations are needed.

    We agree that testing the method on a differently generated dataset is a useful check. We should point out, however, that our synthetic dataset is not as biased as it may seem. In particular, it is based on trees from VJl classes that we predicted are very easy to cluster, which means that they are truly faithful to the data, and not dependent on the particular algorithm used to infer them. The big advantage over this synthetic dataset over others is that it recapitulates the power law statistics of clone size distribution, as well as the diversity of mutation rates. To us, it still represents a more useful benchmark than synthetic datasets generated by population genetics models, which miss most of this very broad variability.

    However, to check how the method generalizes to other datasets, we repeated our validation procedure on the dataset used to evaluate Partis in Ralph et al 2022. The new results are discussed in the main text and in new panels of Fig. 4 in the same form as the previous comparisons. We also added a comparison of performance as a function of mutation rate in the new Figure 4–figure supplement 1.

    It is unclear how to run the code. The software repo has a nice readme explaining the file layout, dependencies, and input file format, but the repo seems to be lacking an inference.ipynb mentioned there which runs an analysis. Perhaps this is a typo and refers to inference.py, which in addition to the documented cdr3 clustering, seems to have functions to run both clustering methods. However, it does not seem to have any documentation or help messages about how to run these functions.

    We have completely overhauled the github to provide a detailed step by step explanation of how to run the code. The code is now easily installable using pip.

    The results are not currently reproducible, because the simulated data is not available. The data availability statement says that no data have been generated for this manuscript, however simulated data has been generated, and that is a key aspect of the analysis in the paper.

    We have uploaded the simulated data to zenodo, as well as provided scripts in the github to run the benchmarks.

    More detail is needed to understand the timing comparisons. The new software is clearly written to use many threads. Were the other software packages run using multiple threads? What type of machine was used for the benchmarks?

    All timing comparisons were made based on a single VJl class on a 14 double-threaded CPU computer. HILARy uses all 28 threads, and other methods were run with default settings, with multi-threading allowed.

    We have clarified the specifications of the computer.

    Reviewer #3 (Public Review):

    B cell receptors are produced through a combination of random V(D)J recombination and somatic hypermutation. Identifying clonal lineages - cells that descend from a common V(D)J rearrangement - is an important part of B cell repertoire analysis. Here, the authors developed a new method to identify clonal lineages from BCR data. This method builds off of prior advances in the field and uses both an adaptive clonal distance threshold and shared somatic hypermutation information to group B cells into clonal lineages.

    The major strength of this paper is its thorough quantitative treatment of the subject and integration of multiple improvements into the clonal clustering process. By their simulation results, the method is both highly efficient and accurate.

    The only notable weakness we identified is that much of the impact of the method will depend on its superiority to existing approaches, and this is not convincingly demonstrated by Fig. 4. In particular, little detail is given on how the other clonal clustering programs were run, and this can significantly impact their performance. More specifically:

    We have added a new benchmark to address these concerns, presented in Fig. 4 and in new figure 4 – figure supplement 1 as a function of a controllable mutation rate.

    (1) Scoper supports multiple methods for clonal clustering, including both adaptive CDR3 distance thresholds (Nouri and Kleinstein, 2018) and shared V-gene mutations (Nouri and Kleinstein, 2020). It is not clear which method was used for benchmarking. The specific functions and settings used should have been detailed and justified. Spectral clustering with shared V gene mutations would be the most comparable to the authors' method. Similar detail is needed for partis.

    In the updated version I use the 2020 version. The 2018 is very similar to simple single linkage so will be removed from the benchmark.

    (2) It is not clear how the adaptive thresholds and shared mutation analysis in the authors' method differ from prior approaches such as scoper and partis.

    We have changed the paragraph in the discussion section about the benchmark to highlight the innovative aspects and differences with previous approaches.

    (3) The scripts for performing benchmarking analyses, as well as the version numbers of programs tested, are not available.

    We have added to the github all the scripts used for benchmarking. We have added details about the version numbers in the data and code availability section of the methods.

    (4) Similar to above, P. 10 describes single linkage hierarchical clustering with a fixed threshold as a "crude method" that "suffers from inaccuracy as it loses precision in the case of highlymutated sequences and junctions of short length." As far as we could tell, this statement is not backed up by either citations or analyses in the paper. It should not be difficult for the authors to test this though using their simulations, as this method is also implemented in scoper.

    We have added this method to our benchmark to support that point. The results are presented in Figure 4 – figure supplement 2.

    References

    Nouri N, Kleinstein SH. 2020. Somatic hypermutation analysis for improved identification of B cell clonal families from next-generation sequencing data. PLOS Comput Biol 16:e1007977. doi:10.1371/journal.pcbi.1007977

    Nouri N, Kleinstein SH. 2018. A spectral clustering-based method for identifying clones from high- throughput B cell repertoire sequencing data. Bioinformatics 34:i341-i349. doi:10.1093/bioinformatics/bty235

    We have changed citation [22] to refer to the 2018 paper. The 2020 paper is citation [18].

  2. eLife assessment

    This important study provides a new, apparently high-performance algorithm for B cell clonal family inference. The new algorithm is highly innovative and based on a rigorous probabilistic analysis of the relevant biological processes and their imprint on the resulting sequences, however, the strength of evidence regarding the algorithm's performance is incomplete, due to (1) a lack of clarity regarding how different data sets were used for different steps during algorithm development and validation, resulting in concerns of circularity, (2) a lack of detail regarding the settings for competitor programs during benchmarking, and (3) method development, data simulation for method validation, and empirical analyses all based on the B cell repertoire of a single subject. With clarity around these issues and application to a more diverse set of real samples, this paper could be fundamental to immunologists and important to any researcher or clinician utilizing B cell receptor repertoires in their field (e.g., cancer immunology).

  3. Reviewer #1 (Public Review):

    Identifying individual BCR/Ab chain sequences that are members of the same clone is a long-standing problem in the analysis of BCR/Ab repertoire sequencing data. The authors propose a new method designed to be scalable for application to huge repertoire data sets without sacrificing accuracy. Their approach utilizes Hamming Distance between CDR3 sequences followed by clustering for a fast, high-precision approach to classifying pairs of sequences as related or not, and then refines the classification using mutation information from germline-encoded regions. They compare their method with other state-of-the-art methods using synthetic data.

    The authors address an important problem in an interesting, innovative, and rigorous way, using probabilistic representations of CDR3 differences, frequencies of shared and not-shared mutations, and the relationships between the two under hypotheses of related pairs and unrelated pairs, and from these develop an approach for determining thresholds for classification and lineage assignment. Benchmarking shows that the proposed method, the complete method including both steps, outperforms other methods.

    Strengths of the method include its theoretical underpinnings which are consistent with an immunologist's intuition about how related and unrelated sequences would compare with each other in terms of the metrics to use and how those metrics are related to each other.

    I have two high-level concerns:
    (1) It isn't clear how the real and synthetic data are being used to estimate parameters for the classifier and evaluate the classifier to avoid circularity. It seems like the approach is used to assign lineages in the data from [1], and then properties of this set of lineages are used to estimate parameters that are then used to refine the approach and generate synthetic data that is used to evaluate the approach. This may not be a problem with the approach but rather with its presentation, but it isn't entirely clear what data is being used and where for what purpose. An understanding of this is necessary in order to truly evaluate the method and results.
    (2) Regarding the data used for benchmarking - given the intertwined fashion by which the classification approach and synthetic data generation approach appear to have been developed, it is not surprising that the proposed approach outperforms the other methods when evaluated on the synthetic data presented here. It would be better to include in the benchmark the data used by the other methods to benchmark themselves or also generate synthetic data using their data generation procedures.

    An improved method for BCR/Ab sequence lineage assignment would be a methodologic advancement that would enable more rigorous analyses of BCR/Ab repertoires across many fields, including infectious disease, cancer, autoimmune disease, etc., and in turn, enable advancement in our understanding of humoral immune responses. The methods would have utility to a broad community of researchers.

  4. Reviewer #2 (Public Review):

    This manuscript describes a new algorithm for clonal family inference based on V and J gene identity, sequence divergence in the CDR3 region, and shared mutations outside the CDR3. Specifically, the algorithm starts by grouping sequences that have the same V and J genes and the same CDR3 length. It then performs single-linkage clustering on these groups based on CDR3 Hamming distance, then further refines these groups based on shared mutations.

    Although there are a number of algorithms that use a similar overall strategy, a couple of aspects make this work unique. First, a persistent challenge for algorithms such as this one is how to set a cutoff for single-linkage clustering: if it is too low, then one separates clusters that should be together, and if too high one joins together clusters that should be separate. Here the authors leverage a rich collection of probabilistic tools to make an optimal choice. Specifically, they model the probability distributions of within- and between-cluster CDR3 Hamming distances, with parameters depending on CDR3 length and the "prevalence" of clonal sequence pairs (i.e. family size distribution). This allows the algorithm to make optimal choices for separating clusters, given the particular chosen distance metric, and assuming the sample in question has been accurately modeled. Second, the algorithm uses a highly efficient means of doing single-linkage clustering on nucleotide sequences.

    This leads to a fast and highly performant algorithm on data meant to replicate the original sample used in algorithm design. The ideas are new and beautifully developed. The application to real data is interesting, especially the point about dN/dS.

    However, the paper leaves open the question of how this inference algorithm works on samples other than the one used for simulation and as a template for validation. If I understand the simulation procedure correctly - that one takes a collection of inferred trees from the real data, then re-draws the root sequence and the identity of the mutations on the branches - then the simulated data should be very close to the data used to develop the methods in the paper. This consideration seems especially important given that key methods in this paper use mutation counts and overall mutation counts are preserved.

    Repertoires come in all shapes and sizes: infants to adults, healthy to cancerous, and naive to memory to plasma-cell-just-after-vaccination. If this is being proposed as a general-purpose clonal inference algorithm rather than one just for this sample, then a more diverse set of validations are needed.

    It is unclear how to run the code. The software repo has a nice readme explaining the file layout, dependencies, and input file format, but the repo seems to be lacking an `inference.ipynb` mentioned there which runs an analysis. Perhaps this is a typo and refers to `inference.py`, which in addition to the documented cdr3 clustering, seems to have functions to run both clustering methods. However, it does not seem to have any documentation or help messages about how to run these functions.

    The results are not currently reproducible, because the simulated data is not available. The data availability statement says that no data have been generated for this manuscript, however simulated data has been generated, and that is a key aspect of the analysis in the paper.

    More detail is needed to understand the timing comparisons. The new software is clearly written to use many threads. Were the other software packages run using multiple threads? What type of machine was used for the benchmarks?

  5. Reviewer #3 (Public Review):

    B cell receptors are produced through a combination of random V(D)J recombination and somatic hypermutation. Identifying clonal lineages - cells that descend from a common V(D)J rearrangement - is an important part of B cell repertoire analysis. Here, the authors developed a new method to identify clonal lineages from BCR data. This method builds off of prior advances in the field and uses both an adaptive clonal distance threshold and shared somatic hypermutation information to group B cells into clonal lineages.

    The major strength of this paper is its thorough quantitative treatment of the subject and integration of multiple improvements into the clonal clustering process. By their simulation results, the method is both highly efficient and accurate.

    The only notable weakness we identified is that much of the impact of the method will depend on its superiority to existing approaches, and this is not convincingly demonstrated by Fig. 4. In particular, little detail is given on how the other clonal clustering programs were run, and this can significantly impact their performance. More specifically:

    (1) Scoper supports multiple methods for clonal clustering, including both adaptive CDR3 distance thresholds (Nouri and Kleinstein, 2018) and shared V-gene mutations (Nouri and Kleinstein, 2020). It is not clear which method was used for benchmarking. The specific functions and settings used should have been detailed and justified. Spectral clustering with shared V gene mutations would be the most comparable to the authors' method. Similar detail is needed for partis.
    (2) It is not clear how the adaptive thresholds and shared mutation analysis in the authors' method differ from prior approaches such as scoper and partis.
    (3) The scripts for performing benchmarking analyses, as well as the version numbers of programs tested, are not available.
    (4) Similar to above, P. 10 describes single linkage hierarchical clustering with a fixed threshold as a "crude method" that "suffers from inaccuracy as it loses precision in the case of highly-mutated sequences and junctions of short length." As far as we could tell, this statement is not backed up by either citations or analyses in the paper. It should not be difficult for the authors to test this though using their simulations, as this method is also implemented in scoper.

    References
    Nouri N, Kleinstein SH. 2020. Somatic hypermutation analysis for improved identification of B cell clonal families from next-generation sequencing data. PLOS Comput Biol 16:e1007977. doi:10.1371/journal.pcbi.1007977
    Nouri N, Kleinstein SH. 2018. A spectral clustering-based method for identifying clones from high-throughput B cell repertoire sequencing data. Bioinformatics 34:i341-i349. doi:10.1093/bioinformatics/bty235