Democratizing data-independent acquisition proteomics analysis on public cloud infrastructures via the Galaxy framework

Matthias Fahrner
Melanie Christine Föll
Björn Andreas Grüning
Matthias Bernt
Hannes Röst
Oliver Schilling

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (GigaScience)

Abstract

Background

Data-independent acquisition (DIA) has become an important approach in global, mass spectrometric proteomic studies because it provides in-depth insights into the molecular variety of biological systems. However, DIA data analysis remains challenging owing to the high complexity and large data and sample size, which require specialized software and vast computing infrastructures. Most available open-source DIA software necessitates basic programming skills and covers only a fraction of a complete DIA data analysis. In consequence, DIA data analysis often requires usage of multiple software tools and compatibility thereof, severely limiting the usability and reproducibility.

Findings

To overcome this hurdle, we have integrated a suite of open-source DIA tools in the Galaxy framework for reproducible and version-controlled data processing. The DIA suite includes OpenSwath, PyProphet, diapysef, and swath2stats. We have compiled functional Galaxy pipelines for DIA processing, which provide a web-based graphical user interface to these pre-installed and pre-configured tools for their use on freely accessible, powerful computational resources of the Galaxy framework. This approach also enables seamless sharing workflows with full configuration in addition to sharing raw data and results. We demonstrate the usability of an all-in-one DIA pipeline in Galaxy by the analysis of a spike-in case study dataset. Additionally, extensive training material is provided to further increase access for the proteomics community.

Conclusion

The integration of an open-source DIA analysis suite in the web-based and user-friendly Galaxy framework in combination with extensive training material empowers a broad community of researches to perform reproducible and transparent DIA data analysis.

GigaScience
Mar 14, 2022
This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac005), which carries out open, named peer-review.

These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 2: Peter Horvatovich

The article GIGA-D-21-00223 entitled "Democratizing Data-Independent Acquisition Proteomics Analysis on Public Cloud Infrastructures Via The Galaxy Framework" describes a targeted DIA LC-MS/MS processing workflow implemented in Galaxy framework. The paper describes the tools integrated in Galaxy environment and the workflows steps to process DIA LC-MS/MS data using targeted spectral library approach. The authors used a HEK cell lysate spiked with E.coli digest at various ratio and used these samples to generate DIA LC-MS/MS data on an Orbitrap QE+ with MS1 scans and 24 50% overlapping DIA …
This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac005), which carries out open, named peer-review.

These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 2: Peter Horvatovich

The article GIGA-D-21-00223 entitled "Democratizing Data-Independent Acquisition Proteomics Analysis on Public Cloud Infrastructures Via The Galaxy Framework" describes a targeted DIA LC-MS/MS processing workflow implemented in Galaxy framework. The paper describes the tools integrated in Galaxy environment and the workflows steps to process DIA LC-MS/MS data using targeted spectral library approach. The authors used a HEK cell lysate spiked with E.coli digest at various ratio and used these samples to generate DIA LC-MS/MS data on an Orbitrap QE+ with MS1 scans and 24 50% overlapping DIA windows between 400-1000 m/z in 4 replicates for each conditions. The implemented workflow contains the library generation from DDA data with MaxQuant processing, library cleaning and analysis of the DIA with OpenSWATH and statistical analysis using MSStat package in R. The authors present identification and quantification of proteins in the example data (differential analysis, volcano plot, CV plot).

The article has a potential interest to the proteomics community as it serves to promote the use of complex DIA data processing workflows in Galaxy web interface, which would otherwise require considerable programming skills and time to establish such workflow from the user. However, the authors should address some major and minor issues before I suggest the article to be accepted.

Major concerns:

The tools and the DIA processing workflows are implemented in Galaxy Europe, which is using for me unknown amount of resource in term of disk space and computational capacity (CPU, RAM). The authors should describe what is the limitations to use this online Galaxy server (maximum amount of upload, CPU time, is there any cost to use the service, limitation of RAM for the tools etc).

Some users do not want to use cloud-based services and public Galaxy server, but would wish to process their data (e.g. clinical sample from humans) on their own local computational closed infrastructure. For these users the authors should provide a tutorial, how to install Galaxy (just refer to Galaxy installation documentation) and how to get the tools from Galaxy toolshed and run their pipeline. Some users may have already a Galaxy server and getting additional tool may interfere, therefore I would strongly suggest creating a docker image where a single instance of Galaxy is installed with all necessary tools and include the raw data and settings in order to provide a clean workflow, that are sure to work.

I would also like to see data on actual runtime of the example dataset, specially focusing on FDR calculation as authors mention that a subsampling of the data is required for this.

I would also present peptide results as protein quantities are obtained after protein inference from multiple peptides, while the instrument is measuring peptides.

CV distribution of proteins in Figure 4a should be compared to other results from other dataset as it shows multimodal and large distribution, which seems to be independent from the spiking levels. This indicate some artifacts in the data.

The data is only submitted to time alignment using iRT peptides, but there is no normalization applied. The authors should check with box-plot/violin plot the individual distribution of peptides and proteins in each replicate and if necessary apply normalization to avoid "upregulated" human proteins. It would be also useful to color the dots in the volcano plot according to the species (human/E coli). The authors refer to displacement effects, which is not explained what it mean in the text (maybe ion suppression?).

Please provide the distribution of the missing values for each replicate as DIA should provide data with low percentage of missing (0) value.

Minors:

All figures and plots look like low resolution bitmap. Please provide high resolution plots preferable made from vector graphic.

Figure 2B, please restrict R2 numbers to 4 decimals.

Page 15, please explain what the contrast matrix is.

Page 15, I would replace "time consumption" to "required execution time"

The author mention in several place (e.g. page 19 and legend of table 2) that they have "developed tools" for DIA analysis. This is not true as they did not develop the original tools but integrated these tools in Galaxy environment in this study. Please correct this.

In figure 3 and supplementary figures 1-4 "Blot" is written, which I guess should be "Plot".

Page 21, Unix is mentioned as operating system, which I guess is not correct, but rather Linux is used. Please provide the distribution and version number.
Read the original source
GigaScience
Mar 14, 2022

This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac005), which carries out open, named peer-review.

These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 1: Paul Stewart

Fahrner et al have produced a very nice manuscript and corresponding pipeline. They describe a collection of DIA tools in the Galaxy framework for reproducible and version-controlled data processing. These DIA tools are an excellent addition to the growing number of proteomics-centric tools already available in Galaxy. The reviewer could find no major revisions needed and therefore only requests a few minor revisions before this is ready for publication:

Please include page numbers in the revised manuscript to make referencing the text easier.

Page 6

OpenSwath and PyProphet are cited and are …

This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac005), which carries out open, named peer-review.

These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 1: Paul Stewart

Fahrner et al have produced a very nice manuscript and corresponding pipeline. They describe a collection of DIA tools in the Galaxy framework for reproducible and version-controlled data processing. These DIA tools are an excellent addition to the growing number of proteomics-centric tools already available in Galaxy. The reviewer could find no major revisions needed and therefore only requests a few minor revisions before this is ready for publication:

Please include page numbers in the revised manuscript to make referencing the text easier.

Page 6

OpenSwath and PyProphet are cited and are also used in the manuscript. Please cite one or two alternatives.

Please consider citing a tool the each time it is used in a new paragraph (e.g. MSstats).

There is heavy reliance on conjunctive adverbs (However, ...; Thus, ...) on this page and throughout the manuscript. These can make passages a bit hard to read. Please consider rephrasing.

Page 7

Why "so-called histories"? Aren't they simply "Histories"?

Page 14

'To decrease the analysis time of the semi-supervised learning, the merged OSW results can be first subsampled using the PyProphet subsample tool and subsequently scored using the PyProphet score tool. '

The reviewer is not familiar with this approach. Can you please give additional justification (maybe under methods?) or provide a citation that this is a reasonable approach?

Page 15

Please check your reference software and/or work with the journal to ensure that the web addresses are linked properly. For example, the reviewer tried copying the link "https://training.galaxyproject.org/training- %20material/topics/proteomics/tutorials/DIA_lib_OSW/tutorial.html" but a "%20" (or a space) is inserted into the URL after "training-" so the link as it appears did not work until this was removed. A less technically savy reader may think the links are broken and will not be able to access the materials.

Page 16

'We identified and quantified between 25.000 to 27.000 peptides ...'

Please be consistent with number formatting (25000 vs 25.000). Other values in the tables did not use this formatting. Please check with journal editor for convention.

Figures

Please be consistent with axes labels. Some are upper case and some are lower case.

Figure 2B

Please round R2 to 2 or 3 decimals.

Figure 3

Please change the red-green color scheme to a more color-blind friendly color scheme (e.g. red blue)

Read the original source
Version published to 10.1093/gigascience/giac005
Jan 1, 2022
Version published to 10.1101/2021.07.21.453197 on bioRxiv
Jul 22, 2021

LAMPrEY: a Python-based automated quality control tool for large-scale proteomics datasets

This article has 9 authors:
1. Mario E. Valdés-Tresanco
2. Soren Wacker
3. Mario S. Valdés-Tresanco
4. Andriy Plakhotnyk
5. Nicholas I. Brodie
6. Morgan Hepburn
7. Annegret Ulke-Lemée
8. Edward L. Huttlin
9. Ian A. Lewis
This article has no evaluationsLatest version May 11, 2026
Reproducible and shareable bioinformatics pipelines from natural-language prompts

This article has 8 authors:
1. Hyeon-Min Kim
2. Hwayeon Jeong
3. Abyot Melkamu Mekonnen
4. Yeongjun Kim
5. Youngchul Oh
6. Heetak Lee
7. Cheulhee Jung
8. Jeongbin Park
This article has no evaluationsLatest version Jun 1, 2026
cran2crux: automatically create CRUX ports for R-packages

This article has 2 authors:
1. Petar B. Petrov
2. Valerio Izzi
This article has no evaluationsLatest version May 13, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Background

Findings

Conclusion

Article activity feed

Related articles

LAMPrEY: a Python-based automated quality control tool for large-scale proteomics datasets

Reproducible and shareable bioinformatics pipelines from natural-language prompts

cran2crux: automatically create CRUX ports for R-packages