Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Machine learning (ML) methodology development for the classification of immune states in adaptive immune receptor repertoires (AIRRs) has seen a recent surge of interest. However, so far, there does not exist a systematic evaluation of scenarios where classical ML methods (such as penalized logistic regression) already perform adequately for AIRR classification. This hinders investigative reorientation to those scenarios where method development of more sophisticated ML approaches may be required.

Results

To identify those scenarios where a baseline ML method is able to perform well for AIRR classification, we generated a collection of synthetic AIRR benchmark data sets encompassing a wide range of data set architecture-associated and immune state–associated sequence patterns (signal) complexity. We trained ≈1,700 ML models with varying assumptions regarding immune signal on ≈1,000 data sets with a total of ≈250,000 AIRRs containing ≈46 billion TCRβ CDR3 amino acid sequences, thereby surpassing the sample sizes of current state-of-the-art AIRR-ML setups by two orders of magnitude. We found that L1-penalized logistic regression achieved high prediction accuracy even when the immune signal occurs only in 1 out of 50,000 AIR sequences.

Conclusions

We provide a reference benchmark to guide new AIRR-ML classification methodology by (i) identifying those scenarios characterized by immune signal and data set complexity, where baseline methods already achieve high prediction accuracy, and (ii) facilitating realistic expectations of the performance of AIRR-ML models given training data set properties and assumptions. Our study serves as a template for defining specialized AIRR benchmark data sets for comprehensive benchmarking of AIRR-ML methods.

Article activity feed

  1. ML

    **Reviewer name: Gael Varoquaux (revision 1) **

    I would like to thank the authors for the work done on their manuscript, in particular adding the experiments that enable linking to sparse-recovery theory. In my opinion, the manuscript brings a lot of value to the application community and is pretty much complete. A few details come to my mind that could help its message be most accurate. Because of my suggestions, the authors have used an l1 penalty in the SVC. This worked well in terms of prediction. However, it is not the default. I think that the authors should stress this and be precise on the peanlity each time they mention the SVC. In addition, I think that there would be value in performing an additional experiment with an l2 penality (which is the default) to stress the importance of the l1 penalty. The message should stress that the penality (l1 vs l2) is importance, but less the loss (log reg vs SVC). As a minor detail, I would invert the color scale of one of the plot plots on figure S12, S13, to stress the parallel between the two. Finally, I think that it is important to stress in the conclusion that all the results build on the fact that the predictive information is sparse (maybe putting this with words more familiar to the application community). Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

  2. Results

    Reviewer name: Filippo Castiglione

    The article "Profiling the baseline performance and limits of machine learning models for adaptive immune receptor repertoire classification by Kanduri1 et al. describes the construction of suitable reference benchmarks data-sets to guide new AIRR ML classification methods. The article is interesting and potentially useful in defining benchmark data sets and criteria for constructing specialized AIRR benchmark datasets for the community of researcher interested in AIRR. The authors following previous indications about model reproducibility and availability also provide a docker container which include all data and procedures to reproduce the study. The article is sufficiently well written although at time a bit full of details which perhaps could be synthesised further (this has already been done in pictures and tables). I don't have major concerns. Only a couple of notes. Would be good to have a figure or diagram showing an example of bags containing receptors and associated witnesses. It could illuminate the reader not familiar with Multiple instanvd learning. Would be good to have line commands for the generation of data sets (in the case, for instance, of use of Olga). I understand these are inside the docker container but the reader that is not interested in the whole container might find useful to have access to pieces of the pipeline so to use this or that tool (being it in immuneML, in Olga, etc.). Curiosity: why have the authors used Olga and not the mate Igor? Why is the performance metric in model training the accuracy and not, for instance, the F1-score? Any particular reason? Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I declare that I have no competing interests I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

  3. Background

    Reviewer name: Enkelejda Miho

    General opinion: approved with minor changes Comments: The manuscript profiles machine learning methods for AIRR T-cell receptor dataset immune state label prediction to establish the baseline performance of such methods across a diverse set of challenges. Simulated datasets with variable properties are used to provide a large amount of benchmarking datasets with known immune state signals while reflecting the natural complexity of experimental datasets. Their results provide insights on the current limits posed by basic dataset properties to baseline ML models and establish a frontier of improvement of AIRR ML research. The manuscript is understandable and well structured in the approach to comparisons as well as solid conclusions. The graphics are clear and consistent and support the manuscript. Very interesting insight into the importance of single individual variable parameters such as sample size or witness rate on the general accuracy. The advantage of the results to the scientific community is that it offers an evaluation of classical ML methods, provides large and specialized AIRR benchmark datasets, and allows further development and benchmarking of more sophisticated ML methods. The manuscript is overall well-written and we endorse it with minor changes: In paragraph Impact of noise on classification performance (page 14) the sentence "but enriched above a baseline in positive class examples" should be corrected with "but being enriched above a baseline in positive class examples" In paragraph Machine learning models (methods section, page 21) "lasso" should be corrected with "Lasso". In paragraph Machine learning models (methods section, page 21) " '- ' " should be corrected with "'-'" and "ð•‘‹jdenoting» with "ð•‘‹j denoting». In the discussion the sentence "which aligns with the observations that that the majority of the possible contacts between TCR and peptide" should be corrected with "which aligns with the observations that the majority of the possible contacts between TCR and peptide" Keep comparisons like size>500 and size > 500 concise Check for missing whitespace as in the description of the figure 1(b): …(5 x 105 % of sequence.. Same in cases like ≈90% | ≈ 90 % or n=60 | n = 60 Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. Enkelejda Miho owns shares is aiNET GmbH. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.

  4. Abstract

    This work has been peer reviewed in GigaScience (see paper), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    **Reviewer name: Gael Varoquaux ** The manuscript by Kanduri et al benchmarks baseline machine-learning method on simulated sequencing data of adaptive immune receptors to predict immune states of individuals by detecting antigen-specific signatures. Given that there is a volume of publication using a wide variety of different machine learning techniques with the promise of clinical diagnostics on such data, the goal of the study is to set baseline expectations. From an application standpoint, I believe that the study motivated and useful to the communitee. From a signal processing standpoint, many aspects of the study are trivial consequences of the simulation choices: sparse estimators are good for prediction when the signal is generated from sparse coefficients. Though I do not know well this application community, it seems to me that the manuscript is valuable because it casts this knowledge in a specific application setting, however it should discuss a bit more the fundamental statistical reasons that underly the empirical findings. I give below some major and minor comment to help make the study more solid.

    1. Plausibility of the simulations The validity of the findings relies crucial on the simulations, in particular the hypotheses of extreme sparsity. These hypotheses need to be discussed more in details, with references to back them. The amount of sparsity as detailed in table 1, is huge, which strongly favors sparse models.
    2. Another baseline, natural given the sparsity I do realize that the goal of this study is not do an exaustive comparison of all machine learning methods --an impossible task--, however for someone knowledgeable about sparse signal processing, In particular, the study begs the question of whether univariate tests on appropriate k-mer can be enough, an alley suggested by the authors on page 7. This option should be studied empirical, as it would provide important practical methods.
    3. Link to sparse model theory A vast variety of theoretical results state that a sparse model will be successful for n proportional to s log(p) where n here would be the number of samples in the minority class, s would be the number of non-zero coefficients. A good summary of these results can be found in the book "Statistical learning with sparsity: the lasso and generalizations T Hastie, R Tibshirani, M Wainwright - 2019" It would be interesting to see how these theoretical scaling match results, for instance those on figure 3.
    4. Accuracy and class imbalance It seems to me that in parts of the manuscript (fig 4.a for instance) accuracy is compared across different scenarios with varying class imbalance. However, accuracy is not comparable when class imbalance varies: for instance with 90% positive class, a classifier that always choose the positive label will have .9 accuracy. In this light, I don't understand fig 4.a, in which even for large class imbalance accuracy goes to .5. In addition, the typical good practice is to use a metric for which decision under chance are not affected by class imbalance, such as area under the curve of the ROC curve.
    5. Comparison with SVC The manuscript mentions that a Support Vector Classifier is also benchmarked, however it does not give details on which specific SVC is used. A crucial point is the kernel used: with a linear kernel, the SVC is a linear model, while with another kernel (RBF kernel, for instance), the SVC is a much more complex model and is not expected to behave well in large p, small n problems. Also, I suspect that the SVC is used with the l2 regularization. A linear SVC with l1 regularization would likely have similar performance as the l1-penalized logistic regression, as it is a model of the same nature. These details should be added; ideally, if the model benchmarked is not a linear SVC, a linear SVC should be benchmarked, to give a baseline (though the default l2 regularization can be used, to stick to common practices).
    6. Wording in the conclusion The conclusion starts with "To help the scientific community in avoiding futile efforts of developing...". The word futile is too strong and the phrasing will not encourage healthy scientific discussion. I try to sign my reviews as much as possible. Gaël Varoquaux Methods Are the methods appropriate to the aims of the study, are they well described, and are necessary controls included? Choose an item. Conclusions Are the conclusions adequately supported by the data shown? Choose an item. Reporting Standards Does the manuscript adhere to the journal’s guidelines on minimum standards of reporting? Choose an item. Choose an item. Statistics Are you able to assess all statistics in the manuscript, including the appropriateness of statistical tests used? Choose an item. Quality of Written English Please indicate the quality of language in the manuscript: Choose an item. Declaration of Competing Interests Please complete a declaration of competing interests, considering the following questions:  Have you in the past five years received reimbursements, fees, funding, or salary from an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold any stocks or shares in an organisation that may in any way gain or lose financially from the publication of this manuscript, either now or in the future?  Do you hold or are you currently applying for any patents relating to the content of the manuscript?  Have you received reimbursements, fees, funding, or salary from an organization that holds or has applied for patents relating to the content of the manuscript?  Do you have any other financial competing interests?  Do you have any non-financial competing interests in relation to this paper? If you can answer no to all of the above, write 'I declare that I have no competing interests' below. If your reply is yes to any, please give details below. I have no competing interests. I agree to the open peer review policy of the journal. I understand that my name will be included on my report to the authors and, if the manuscript is accepted for publication, my named report including any attachments I upload will be posted on the website along with the authors' responses. I agree for my report to be made available under an Open Access Creative Commons CC-BY license (http://creativecommons.org/licenses/by/4.0/). I understand that any comments which I do not wish to be included in my named report can be included as confidential comments to the editors, which will not be published.