RSYD-BASIC: a bioinformatics pipeline for Routine Sequence analYsis and Data processing of BActerial iSolates for clInical miCrobiology

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background Whole genome sequencing of bacterial isolates is increasingly becoming routine in clinical microbiology; however, subsequent analysis often needs to be started by a bioinformatician even for comprehensive pipelines. To increase the robustness of our workflow and free up bioinformatician work hours for development and advanced analysis, we aimed to produce a robust, customizable bioinformatic pipeline for bacterial genome assembly and routine analysis results that could be initiated by non-bioinformaticians. Results We introduce the RSYD-BASIC pipeline for bacterial isolate sequence analysis and provide a demonstration of its functionality with two datasets composed of publicly available sequences,  in which comparable results are obtained in most cases. In some instances, the pipeline provided additional information, corresponding to in vitro results where these could be obtained. In routine use at our department, the pipeline has already yielded clinically relevant results, allowing us to type a variety of bacterial pathogens isolated in our clinical laboratory. We also demonstrate how RSYD-BASIC results aided in disproving a potential outbreak. Conclusion With the RSYD-BASIC pipeline, we present a configurable reads-to-results analysis pipeline operated by non-expert users that greatly eases investigation of potential outbreaks by expert end-users. Results obtained with publicly available sequences show comparable performance to the original methods, while underlining the importance of standardized methods.

Article activity feed

  1. Dear Kat Steinke, Thank you for your response and substantial changes, before i return this to reviewers, I would request that the data for the minimum spanning tree be included. I'm not aware of any ethical concerns including this as it will not have any patient data associated. If this cannot be included, i suggest that the data and section is removed. We require all data to be included and available that is discussed or used in the manuscript for reviewers. Best wishes, John.

  2. Dear Kat Steinke, Thank you for your patience. I've now secured several reviews for the manuscript and there is an agreement of Major revisions. Please address the reviewers comments and any questions let me know. I'd also like to apologies for the delay in returning this to you. Securing reviewers in this instance took longer than we would have liked. Best wishes, John.

  3. Comments to Author

    This study presents a pipeline designed to process raw short-read data and analyse it using various bioinformatics tools, specifically tailored for clinical microbiologists with limited experience in using command-line bioinformatic tools. Major comments: L90-98, as the editor previously pointed out, the manuscript is in English, and consequently, the tool's input should also be in English to ensure accessibility for readers. Using Danish input limits the tool's utility and hinder broader usage specifically when published in an English language journal. Minor comments: L53: The term "mixed samples" in this context is ambiguous and could be interpreted in two different ways: 1. Samples containing multiple bacterial species. 2. Samples that include DNA from different sources, such as both bacterial and human DNA. Metagenomic sequencing typically refers to the second scenario, which requires a distinct bioinformatic pipeline for data analysis. Providing further clarification will help avoid ambiguity in understanding the study's scope and methodology. L71-72: Using bioinformatics tools requires familiarity with computational techniques and command line environment, which does not necessarily require expertise in computer science from a bioinformatician. I think the key point here is the importance of having familiarity with computational biology to run and interpret bioinformatics analysis, rather than requiring expertise in bioinformatics to use these tools.

    Please rate the manuscript for methodological rigour

    Satisfactory

    Please rate the quality of the presentation and structure of the manuscript

    Satisfactory

    To what extent are the conclusions supported by the data?

    Strongly support

    Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

    No

    Is there a potential financial or other conflict of interest between yourself and the author(s)?

    No

    If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

    Yes

  4. Comments to Author

    The short report by Steinke and colleagues describes a bioinformatics pipeline written in SnakeMake which automates the analysis of WGS data using standard (though unvalidated) research tools in a clinical setting. This is interesting because (at least in the UK), there are still very few centers capable of performing such analyses and they tend to be confined to a small number of academic hospitals. I would suggest re-writing the article to focus on the implication of WGS in your clinical setting with particular focus on the referral process for WGS, quality control and interpretation of results. For example, you could provide more detail on the suspected CLABSI outbreak showing how the output from your pipeline changed management would be nice. Perhaps some timelines and trees? I find the manuscript a bit tricky to read because much of the results section is actually methods (written as single sentence paragraphs). I appreciate that some description of your tool is indeed a result here but as above would suggest focussing the results more on the application of the tool (which I think is actually more of a secondary result here - it is the application of this that is of more interest in my opinion). I would strongly suggest that the authors substantially tone-down any implication that this is a tool ready for general clinical deployment. It is essentially just another locally developed solution (albeit it looks like a reasonably good one) that can be combined with significant expertise available in an academic center to aid in local infection control procedures. It certainly does not deliver an off the shelf solution for infection doctors who are not WGS experts. I think the output you generate still requires a substantial degree of expertise to interpret. Looking at the example in Fig1, I'm not clear that this is really a major advance over the use of command line tools? Our lab scientists can be taught how to run repetitive operations in command line quickly. I might have mistaken this, but you haven't actually written a GUI here and so the system presented isn't really that different to e.g. Bactopia except for the fact that it integrates patient data (though this is likely to require a fairly substantial effort to get it to work with a local electronic health/pathology record system). The system only works with Illumina data - this is a fairly substantial limitation as is the fact that it cannot handle metagenomics. I would suggest some discussion of this. The use of Kraken for "contamination" is interesting. Given the vast degree of gene-sharing between e.g. the Enterobacterales, how do you know whether there is truly any contamination (or indeed mixed infection)? If a patient has an E. coli infection but Kraken says there are also a minority of Klebsiella reads present, what do you do? I think quite a lot of work is required to work out the relevant thresholds and this needs great care and thought before it is deployed to a clinical setting. What QC parameters are used and how have you derived and validated these? This is extremely important given the intended clinical use of your pipeline and inadequately described in the paper. How do you cope with dumping a huge amount of data (much of which is likely of questionable clinical relevance) onto your electronic healthcare record system? I'm not convinced that giving clinicians arbitAMR/plasmidFinder/MLST results is useful? This still requires substantial expert interpretation from a clinician who has a solid understanding of bioinformatics (so that they are aware of the pitfalls of this data). Such individuals are extremely scarce at the moment. How do you decide what to do for example if you detect a resistance gene but your lab has called your isolate sensitive? The governance implications and clinical judgment required is difficult. In my opinion a validation analysis on 4 isolates is inadequate, particularly given that you find discrepancies compared to published analysis in 25% of your dataset. Given that this is an automated pipeline, it should be straightforward to run e.g. thousands of isolates and check that your tool gives the expected results? I think in an ideal world and to gain proper validation one should include reference strains in this analysis. Can you describe in more detail the referrals to your sequencing service? How are these initiated? Who approves them? What is the median (and range) of time to result? What proportion yield a change in management? L199-200 - you need to describe your SNP/cgMLST analysis in more detail in the methods. How do you choose a reference genome? How do you decide what a relevant SNP threshold is? Also given that this is a manual process, it isn't actually part of what your pipeline does if I have understood correctly? Minor: L53 - "Long been viewed as the next step in clinical diagnostics". I suggest rewording - this is a bit colloquial and I would say that this is certainly not a universal view. I have several colleagues who are skeptical that it may ever have mainstream clinical applications (which I don't necessarily agree with). L63 - uploading sequencing data is not a patient privacy concern, it is only the associated metadata/patient identifiers that are a problem (so long as no human reads are sequenced obviously). The results section of your abstract is too vague and needs to be re-written so that it objectively describes the results of the study (one of which is the creation of the tool I guess).

    Please rate the manuscript for methodological rigour

    Satisfactory

    Please rate the quality of the presentation and structure of the manuscript

    Poor

    To what extent are the conclusions supported by the data?

    Partially support

    Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

    No

    Is there a potential financial or other conflict of interest between yourself and the author(s)?

    No

    If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

    Yes

  5. Comments to Author

    Summary The manuscript introduces RSYD-BASIC bioinformatics pipeline designed for the analysis of bacterial isolates in clinical setting using whole genome sequences. The pipeline was designed to support clinicians with little or no bioinformatics background, enabling a more robust and easier interpretation of clinically relevant results from bacterial isolate sequences. This is an important area of study and provides valuable contribution to the wider community. The authors developed a pipeline that links widely used bioinformatics tools to analyse raw Illumina reads. Initially designed to support the Danish healthcare system, the pipeline requires Danish input and provides Danish output. To assist non-Danish speakers use the pipeline, authors kindly provided exemplar datasheets in English in supplementary data. Unfortunately, I am of the opinion that, given the manuscript is in English and submitted to international journal, the input and output for the pipeline should also be in English. I believe that having the pipeline currently accepting only Danish input and providing Danish output, it possesses a great limitation for potential non-Danish speaking users. For instance, users seeking pipelines or software supporting English input and output for convenience may explore alternative options. Enabling English input and output would support easier and faster result interpretation and, most importantly, help prevent potential errors. While the manuscript is concise and focused, certain sections could benefit from minor improvements for better clarity. Additionally, specific information is needed in some parts to ensure the reproducibility of results and allow others to verify the pipeline's suitability for their analyses (please refer to the more specific comments section). Given that alterations are required for the pipeline to accept English input and provide English output, along with corrections to the manuscript for result reproducibility, I am marking this as "major revisions". Specific comments: 1. Certain parts of manuscript need more in-depth explanation I am pleased with the manuscript being concise and focused on relevant information. From the manuscript it is clear what the pipeline is intended to do and how wider community can benefit from it. However, there is a concern that non-bioinformaticians or scientists without a background in genomics may face challenges in understanding how the pipeline performs the analysis. This lack of understanding could potentially impact the interpretation of results and the assessment of their reliability by potential users. - Line 143-144 (…QC results are evaluated - this is partially automated, but some results may require manual examination…), a more in-depth explanation is needed regarding the specific aspects that are evaluated, distinguishing between automated and manual processes. - Line 125 (…species identification is performed using GTDB-Tk), due to the nature of this pipeline being aimed for clinicians, it would be beneficial to provide a straightforward explanation of what GTDB-Tk does and the criteria it employs for assigning taxonomy. Since clinicians may not be familiar with all tools used in taxonomic classification, a simplified explanation will provide a better understanding. - Line 121 (…Most other analyses…), need to be more specific. These examples are indicative of areas within the manuscript where additional details are needed to facilitate a more thorough understanding of the pipeline and its operations. Please examine the manuscript for similar issues and address them accordingly. 2. Reproducibility issues Reproducibility is crucial in bioinformatics as it helps identifying errors in data processing, analysis, and interpretation. To ensure reproducibility, all relevant information must be provided so that the same results can be obtained from the raw data. I am pleased to see that authors provided read accessions in the forms of a table (Table 2 and Table3), and listed genome assemblies accessions throughout the text (line 248). This is generally a good practice for reproducibility. However, I identified issues that might hinder the reproducibility, and I strongly recommend to add the relevant information to maximise it. This includes: - Versions of all software used to carry out analyses (eg. ChewBBACA, MLST tool, ExtractCgMLST and more…). In line, 167 the authors stated ,"the current version of the MLST tool". This might not be sufficient and could be confusing, as the manuscript can be read at different times, and new versions might also be release before the manuscript is published, leaving readers with ambiguous interpretations. - List software parameters and setting. I am pleased to see that authors in some cases for example, line 259, stated that default setting were used for ExtractCgMLST. However, this information is not provided for all software, for example for MLST tool which, allows users to change paraments for things like --minid, --mincov and --minscore, which can affect the output given. - Please list all sample statistics that have been carried out (line 253). I would strongly encourage authors to provide all code used to carry out the analyses in a supplementary data as well. This might help other scientists with reproducibility and will ensure transparency, which are crucial for science integrity. 3. Other comments I am pleased to see that authors have cited all software used in the pipeline and manuscript. This acknowledges the efforts of the developers who created the software and gives them credit. Citing tools used in the analysis also adds transparency to your methodology. However, some databases were not cited, and I would like to request references for RefSeq (line 120) and NCBI (line 166,236). If certain analysis was carried out and certain output was generated as stated in manuscript, community would expect this to be provided either in the main text or supplementary data. Unfortunately, I am unable to locate the minimum spanning tree mentioned in line 261. Similarly, as stated in Line 188-190, the pipeline was used to process sequencing from 498 individual bacterial isolates. I understand that there might be ethical issues for sharing the data, but it would be beneficial to supply the data for at least the case mentioned in line 191-209. This encourages transparency, and allow potential users to analyse data on their own. 4. Code documentation I am pleased to see that authors support open source software, and provide a short guide on the usage of the pipeline. I believe that the community might benefit from a few additional information being added. This includes: - A clearer inductions for the installation of the pipeline, and listing all dependencies and libraries needed for the pipeline to work. For example, by following the current induction and trying to run the "classic" mode, the following error has been raised (similarly for yaml module): Traceback (most recent call last): File "run_pipeline.py", line 21, in ‹module› import pandas as pd ModuleNotFoundError: No module named 'pandas' This might be an easy fix for some, others might struggle. - A short tutorial with an exemplar data. This way, potential users will be able to ensure that all dependencies were installed correctly, and there is consistency in the obtained results, independent of the operating system. This tutorial with exemplar data will also help new users, especially those unfamiliar with the tool, to quickly grasp its usage. Kind regards, Angelika Kiepas

    Please rate the manuscript for methodological rigour

    Poor

    Please rate the quality of the presentation and structure of the manuscript

    Poor

    To what extent are the conclusions supported by the data?

    Partially support

    Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

    No

    Is there a potential financial or other conflict of interest between yourself and the author(s)?

    No

    If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

    Yes

  6. Dear Kat Steinke, Thank you for your efforts so far. I’m returning the manuscript with similar comments as before but with more detail on addressing some issues. In general, the manuscript is very short and to the point. Much of the information is obviously in the Gitlab page but the point of the publication is more than to advertise this (as this could technically be cited directly). Once the below changes are made, i believe the manuscript will be suitable for official review. Best wishes, John. Notes by section: Author names, typically I would expect this to be a single line. Abstract: In general, these consist of between 200 and 250 words. I would encourage you to use this word limit. Abstracts form the basis of how many readers dive into a paper. As it stands, the lack of information within the abstract may indeed be an issue for readers. It should also form similar to a “mini-paper” I.e. it is constructed with introductory material, perhaps some methods, results and your major conclusions/take home message. While generalised, I would expect in your case it to contain more information. This may seem redundant but its compounded by the brevity of the manuscript itself. Introduction: I would include a reference/example where possible for line 54. Due to the brevity of the manuscript, I would either include examples of metadata sheets with descriptions directly or minimum link specifically to the file (I see there is one in the Gitlab). Additionally, as the manuscript is in English, so should the metadata sheet etc. I would request all primary resources be in the language of the manuscript where possible. Results: I would discourage the use of a screen shot for figure 1 and instead use formatted in line text similar to would you would find her in the installation section: https://github.com/rrwick/Deepbinner git clone https://github.com/rrwick/Deepbinner.git pip3 install ./Deepbinner deepbinner –help This is partly due to standard formatting approaches (which you have included in the Gitlab repository) but also that screenshot resolution/sizing can be difficult to adjust for readers with additional visual needs. I’d expand on the figure legends where possible, two for example could include information on the colour scheme, why it’s important. Due to the nature of the paper, I would encourage expanding some of the description steps from line 94. The information included is enough for people who are to some extent experienced but not those that are likely to be the primary users. “Read cleaning” for example isn’t clear unless you have some experience with library generation. This section highlights clearly the benefit of the pipeline, in general, the manuscript Discussion Typically, this section is used to place the results/manuscript in context with the literature. Due to the lack of references in the section, it is relatively redundant. I encourage expanding on this section or forming a combined results and discussion section. In both cases, I would expect fairly substantial expansion. For example, lines 174-184, do you have any rationale for why the differences exist? – is this due to versions of software, different software being used in analysis etc.

  7. The work presented is clear and the arguments well formed. This study would be a valuable contribution to the existing literature. This is a study that would be of interest to the field and community. Dear Kat Steinke, Thank you for your submission. I've chosen major revisions before sending the article out to review primarily because of the quality of the manuscript, not the quality of the work, which from initial investigation looks good. The work is interesting and clearly applicable, the down side is that the manuscript is underdeveloped. Firstly, much of the standard formatting i would expect is not present. While not catastrophic, i would suggest viewing some manuscripts for examples and also view recommended formatting here: https://www.microbiologyresearch.org/prepare-an-article#6. While i very much encourage much of what exists in the manuscript, we need more. More detail, certainly i would encourage the inclusion of test data (which i think may have been included in the Gitlab page), but also an analysis and discussion of the outcome, fundamentally this would show the validity of the pipeline and allow the authors to guide users more thoroughly. Examples like line 147, i would expect more information surrounding this. Additionally, the project fundamentally is expressed for use by relatively non-experts. While not considering my self an expert bioinformatition by any means, I've used many of the tools in the manuscript and similar repositories (Github). I would encourage, making the Gitlab page as user friendly as possible, my experience has been, completely new users wouldn't know where to start. I would certainly encourage at least some of this to be in the manuscript, again with the non-expert use in mind. I would commend however, the scripts and comments within as they appear to be robust and helpful in understanding the process and what each stage is doing. Ultimately, i fear the manuscript in its current stage would require several rounds of revisions and i encourage authors to develop the manuscript further before resubmission. Please feel free to contact us if you require further guidance. Best wishes, John. P.s. I'm be curious to know why this name specifically was selected from others, always enjoy an interesting title!