Variance Analysis of LC-MS Experimental Factors and Their Impact on Machine Learning

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data processing pipeline from raw data analysis to end-user predictions and re-scoring. ML models need large-scale datasets for training and re-purposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs.

Results

We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variance in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning.

Conclusions

Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it’s important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pre-trained model.

Article activity feed

  1. AbstractBackground Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data processing pipeline from raw data analysis to end-user predictions and re-scoring. ML models need large-scale datasets for training and re-purposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs.Results We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variance in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning.Conclusions Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it’s important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pre-trained model.Competing Interest StatementThe authors have declared no competing interest.

    **Reviewer 2. Luke Carroll **

    The paper applies machine learning to publicly available proteomics data sets and assesses the ability to transfer learning algorithms between projects. The primary aim of these algorithms appears to be an attempt to increase consistency of retention time prediction for data-dependent acquisition data sets, however this is not explicitly stated within the text. The application of machine learning to derive insight from previous performed proteomics experienced is a worthwhile exercise.

    1. The authors report ΔRT to determine fitting for their models. It would be interesting to see whether the models had other metrics used to assess fitting, or could be used to increase number of identifications within sample sets, and whether this was successful. ALternatively, was there any conclusions able to be drawn about peptide structure and RT determination from these models?

    2. Project specific libraries are well known to improve results compared with publicly available databases, and the discussion on this point should be developed further through comparison of this work with other papers - particularly with advances in machine learning and neural networks in the data independent analysis field.

    3. Comparison of Q-Exactiv models vs Orbitraps appears to be somewhat redundant, and possible a result of poor meta-data as Q-Exactiv instruments are orbitrap mass spectrometers. A more interesting comparison to make here would be between orbitrap and TOF instruments, though as the datasets have all been processed through MaxQuant, it is likely the vast majority were acquired on orbitrap instruments.

    4. The paper uses ΔRT as the readout for all models tested, however the only chromatography variable considered in testing the models is gradient length. However, chromatography is also dependent on column chemistry, column dimensions, composition of buffer, use of traps, temperature etc. These are also likely to be contributing the variance observed between the PT datasets where these variables will be consistent and publicly available datasets. These factors are also likely to play a role in higher uncertainty for early and late eluting peptides where these factors are likely to vary most between sample sets. The metadata may not be available to use to compare within the data sets selected, so the authors should at minimum make discussion around these points.

    5. Sample preparation is likely to have similar effects, and as the PT datasets are generated synthetically using ideal peptides, publicly available datasets will be generated from complex sample mixtures, and have increased variance due to inefficiencies of digestion, sample clean up and matrix effects. Previous studies on variance have also described sample preparation as the highest cause of variance. This needs further discussion

    6. While the isolation windows of the m/z will lead to unobserved space, search engines setting will also apply here. From the text, it appears that the only spectra that were considered were those already identified in a search program (due to having Andromeda cut-off scores always apply). Typical setting for a database search will have a cut off of peptide sequences of at least 7 residues, making peptide masses appearing lower than 350 m/z unlikely. There is also significant amount of noise below 350 m/z and this also likely contributes to poorer fitting.

    7. The authors identify differences in MSMS spectral features, however, most of these points are well known in the field. The authors should develop the discussion on the causes of the differences in fragmentation, as CID low mass drop off is expected, and the change in profile is expected with increasing activation energies. A more developed analysis could exclude precursor masses from these plots and focus solely on fragment ions generated.

    8. The authors highlight that internal fragmentation of peptides could be used as a valuable resource to implement in machine learning. There has already been some success using these fragmentation patterns for sequence identification within both top-down and bottom up proteomic searches that the authors should consider discussing. However, these data do not appear to be incorporated into the machine learning models in this paper - or at least seem not to play a significant role in prediction, and this section appears to be a bit out of place.

    Re-Review The changes and additions to the discussion for the paper address the key points, and have addressed some of the limitations imposed by the availability and ability to extract certain data elements particularly around sample preparation and LC settings. I think this strengthens their manuscript, and provides a more wholistic discussion of factor in the experimental setup.

  2. Background Machine learning (ML) technologies, especially deep learning (DL), have gained increasing attention in predictive mass spectrometry (MS) for enhancing the data processing pipeline from raw data analysis to end-user predictions and re-scoring. ML models need large-scale datasets for training and re-purposing, which can be obtained from a range of public data repositories. However, applying ML to public MS datasets on larger scales is challenging, as they vary widely in terms of data acquisition methods, biological systems, and experimental designs.Results We aim to facilitate ML efforts in MS data by conducting a systematic analysis of the potential sources of variance in public MS repositories. We also examine how these factors affect ML performance and perform a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning.Conclusions Our findings show significantly higher levels of homogeneity within a project than between projects, which indicates that it’s important to construct datasets most closely resembling future test cases, as transferability is severely limited for unseen datasets. We also found that transfer learning, although it did increase model performance, did not increase model performance compared to a non-pre-trained model.

    This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad096), and has published the reviews under the same license. These are as follows.

    **Reviewer 1: Juntao Li **

    This paper aimed to facilitate machine learning efforts in mass spectrometry data by conducting a systematic analysis of the potential sources of variance in public mass spectrometry repositories. This paper examined how these factors affect machine learning performance and performed a comprehensive transfer learning to evaluate the benefits of current best practice methods in the field for transfer learning. Although the experimental content is extensive and provides promising results, some major points need to be addressed as follows:

    1.Please explain the rationality of the RT used for evaluating model performance. In addition, it is necessary to increase other evaluation metrics to provide a more powerful comparison of model performance.

    2.The curves in Figures 6 and 8 should provide more explanations to help readers understand. In addition, all figures are somewhat blurry and clearer figures should be provided.

    3.This paper does not provide specific implementation steps of variance. Please describe the variance analysis process in mathematical language and provide the corresponding mathematical formula.

    4.There are some formatting issues: Keywords and the title 'Data Description' should only have the first letter capitalized. On pages 6, 17, and 18, the font size of the article is inconsistent.

    5.There are some grammar issues: On pages 6 and 16, dataset should be added with 's'. On page 7, lines 9-10, the tense is not unified.

    6.There are significant issues with the format of references. Inconsistent capitalization of initial letters in literature titles, such as [1] and [5]; Some literature lacks page numbers, such as [6] and [18]. Please re- organize the references according to the format required by the journal.

    Re-Review:

    I am glad to see that the authors have revised the manuscript based on the reviewer's comments and improved its quality. However, the responses to some comments did not fully convince me. I suggest the authors further revise or explain the following issues.

    1. I agree the rationality of ΔRT as a performance measure, but does not agree with the author's viewpoint of 'However, as the model performance indicates metric variance, and there are no changes to the conclusions drawn from the model performance'. I suggest the authors truthfully provide other classic machine learning performance metrics on the test dataset and analyze the differences.

    2. In order to avoid randomness caused by single data partitioning (training and testing data partitioning), multiple random data partitioning strategie (100 or 50 times) is usually adopted to evaluate the performance of learners using multiple average performance measures and variance. It is recommended that the authors consider this issue.

    3. The structure and references of the papers that I have seen that have been officially published in GigaScience are very different from the manuscript (the author has claimed to have organized and written according to the requirements). I am not sure if it was my mistake or the authors' mistake. I suggest the authors confirm the issue again and improve the writing.