SNiPgenie: A tool for microbial SNP site detection from whole genome sequencing data

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Whole-genome sequencing (WGS) of microbial pathogens provides a high-resolution approach to antibiotic resistance profiling, lineage classification, and outbreak surveillance. Identification of single nucleotide polymorphisms (SNPs) across the genome by alignment against a reference genome is the most high precision method of delineating strains. SNiPgenie is a bioinformatics pipeline designed to perform the entire variant calling process across many samples simultaneously. It was developed in the context of developing WGS tools to support the tracking of infection transmission of Mycobacterium bovis in livestock and wildlife, the principal causative agent of TB in these populations in Ireland. SNiPgenie may however be applied to other bacteria where evolutionary change can be tracked accurately using SNPs. The tool comes with both a command line and a user-friendly graphical interface. It can run on standard desktop or laptop computers. SNiPgenie and its documentation are available at https://github.com/dmnfarrell/snipgenie.

Article activity feed

  1. Thank you very much for submitting your manuscript to Access Microbiology. It has now been reviewed by two experts in the field, whose comments are attached at the bottom of this email. In general they agree that the proposed tool (SNiPgenie) has potential and can be a valuable resource. However, they have identified a number of shortcomings in the manuscript that make them question the applicability of this tool. The have very kindly provided an extensive amount of recommendations to improve the methodological rigor of this work, including further validations and comparative analyses, and enhance the readability and understanding of the manuscript. We would be interested in considering a revised manuscript that thoroughly addresses all of their concerns and suggestions. Please provide said revised manuscript (including a tracked changes document), along with a point-by-point response to the reviewer comments within 3 months.

  2. Comments to Author

    Summary Whole-genome sequencing (WGS) is increasingly used in clinical and public health microbiology, leading to the development of many bioinformatics pipelines. Single-nucleotide polymorphism (SNP) analysis provides high-resolution insights into strain characteristics and relatedness. While various SNP analysis methods exist, there is no established gold-standard benchmarking approach, and pipeline selection is often based on individual preference. The establishment of an agreed upon gold-standard benchmarking process for microbial variant analysis is becoming increasingly important to aid in its robust application, improve transparency of pipeline performance under different settings and direct future improvements and development. This work describes SNiPgenie, a bioinformatics pipeline for variant calling and phylogenetic analysis from WGS data. The software appears to be functional and openly available. Given that Access Microbiology prioritises methodological rigor over novelty, the manuscript should still provide sufficient validation, practical applications, or methodological insights beyond the tool's documentation (https://github.com/dmnfarrell/snipgenie) and what is already publicly available (10.3389/fvets.2021.780018). In its current form, the manuscript largely summarises existing information rather than presenting a detailed evaluation of SNiPgenie's performance and utility. Major Comments The manuscript provides an overview of SNiPgenie's functionality but lacks new scientific insights and comparative analysis. The authors mention that the tool has been "previously benchmarked" (10.3389/fvets.2021.780018), but the manuscript does not include any specific benchmarking results beyond a brief statement that "performance was found to closely match other tools when measured on synthetic genomes generated from a real-life phylogeny". To strengthen the paper's methodological rigor, the following aspects should be addressed: 1) Include a summary of key findings from the benchmarking study to contextualise SNiPgenie's performance. 2) Discuss how SNiPgenie differs from or improves upon existing and well-established SNP calling pipelines (e.g., BactSNP, Clair3, Snippy, Lyveset2, or SPANDx) beyond its dual command-line and GUI interfaces. 3) Demonstrating the tool's effectiveness with case studies or real-world datasets would improve its applicability to microbiological research. Methodological rigour, reproducibility and availability of underlying data The manuscript would benefit from additional details regarding: 1) Computational performance: Information on runtime, memory usage, and scalability across different datasets would help potential users assess feasibility. Instead of a general recommendation like "A computer with at least 4-8 processor threads and 8GB of RAM is recommended", consider including performance benchmarks under various computational setups to give a clearer picture of resource requirements. 2) Accuracy and validation: While previous benchmarking is cited, including a summary or additional real-world validation would strengthen the manuscript. 3) Limitations and edge cases: A discussion of conditions under which SNiPgenie may not perform optimally would provide transparency and assist users in making informed decisions. Providing performance benchmarks and a comparative analysis against commonly used tools would enhance the manuscript's scientific value. See: 10.1101/2022.05.05.487569 Presentation of results The manuscript dedicates considerable space to software usage instructions, which could be streamlined or referenced in the documentation. For example, a significant portion (approximately lines 82-159) of the manuscript focuses on detailed usage instructions such as: * Input file formatting requirements * Command-line options and their parameters * Python API usage with code examples (reads more like user documentation than scientific content) * GUI functionality descriptions * Plugin details Instead, a stronger focus on methodology, validation, and comparative performance would enhance scientific merit. Specific areas for improvement include: Introduction: Clearly articulate the gaps in existing SNP detection tools and how SNiPgenie addresses these gaps. Methods and Results: Provide a structured comparison of SNiPgenie's workflow with other tools, along with performance metrics. Discussion: Expand on the tool's broader implications, potential applications, and areas for future development. Specific Comments Lines 57-66: The manuscript states SNiPgenie was "primarily written to be used with bacterial isolates of M. bovis" but later describes it as a general-purpose tool. Clarification on its applicability to other organisms would be helpful. Lines 60-67: The benchmarking discussion is minimal. If this is a core strength of the tool, some summary of the benchmarking results should be included directly in this manuscript rather than simply referencing another paper. Lines 68-80: The technical description lacks specifics on implementation details that would be relevant to bioinformaticians. For example "Alignment is done using bwa [11] though other aligners may be used. At the core of all the variant calling steps are samtools and bcftools". Are there any parameters are used for variant calling tools/software? Versions of tools? SNP filtering strategies? Are there any novel approaches implemented? Lines 91-97: The outputs section would benefit from additional explanation of file contents and their relevance to users. Lines 100-106: The variant calling method description is extremely brief and lacks technical details on parameters, default settings, or rationale for methodological choices. Lines 163-168: The conclusion makes claims about versatility and user-friendliness without providing evidence to support these claims through user testing, performance metrics, or comparative analysis.

    Please rate the manuscript for methodological rigour

    Poor

    Please rate the quality of the presentation and structure of the manuscript

    Satisfactory

    To what extent are the conclusions supported by the data?

    Not at all

    Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

    No

    Is there a potential financial or other conflict of interest between yourself and the author(s)?

    No

    If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

    Yes

  3. Comments to Author

    The authors have presented open access documentation for SNiPgenie on github. The source code is also accessible on Zenodo. The SNiPgenie github page has documentation of use on the wiki, use of flags for the tool, and includes installation instructions. It notes that the software can be installed for use on Linux, including through on Windows Subsystem for Linux. It can be run through a Command Line Interface (CLI) or through use of a Graphical User Interface (GUI). They also provide example usage with code on the github page and descriptions of the output files. There are four inbuilt reference genomes as part of the software, which are all Mycobacteria. Users have the option to use their own reference genomes instead, with use of a flag to do this. Overall, I believe that when the authors address the below comments and improve the manuscript, it will be suitable for publication in Access Microbiology. SNiPgenie will provide a good additional bioinformatics tool for those interested in SNP analysis of bacterial genomes, particularly in MTBC. A strength of the pipeline is the addition of a GUI, making it more accessible to those who prefer a visual representation. Additionally, the integrated flag to create a RAxML phylogenetic tree is very useful, particularly for those who are not as familiar with drawing phylogenetic trees. * The authors mention that SNiPgenie was created for use on MTBC but can be used for other bacterial genomes where evolutionary change can be observed through SNPs. It would be good to see an example of its use in another bacterial species, and how the software compares to MTBC in terms of effectiveness and accuracy. * Can the authors address how they deal with any version changes of external tools that they use for SNiPgenie and what the effects may be on their software. Also, could the authors please add version numbers of the other tools they are using such as bwa (line 71), bcftools, samtools (line 72), bowtie, subread and minimap2 (line 112) for the current version of SNiPgenie. * Lines 59-61: SNiPgenie is compared to other SNP calling tools and is mentioned to match them in performance. Could the authors A) add a reference or example dataset to the part where it states the performance was matched, and B) expand a little more on this fact. If the performance was similar, can the authors highlight any other advantages or factors of its use over the other tools? * It would be nice to see a figure illustrating the workflow of SNiPgenie to enhance the 'methods and results' section of the manuscript. This will also help with ease of use for the user/encourage new users through better understanding of how the software works. * Line 75: It is not clear whether the output files will be left partially completed and new ones created, or whether the files will be continued when the run is resumed. Please could this be a little clearer. * Line 76: It states that the files can be overwritten if requested. Could the authors add how this can be done. * Line 76: It states that 'other options provided at the command line are detailed below'. The flow of this section is a little unclear, could it be clarified. * Line 78: This feels a little contradictory. It states that trimming of sequences may not be of great benefit to a SNP-calling pipeline. However, it also states that users should trim their fastq files before input, if applicable. Could the authors expand on what would mean by 'if applicable', or whether trimming does not need to occur in any instance. It would be interesting to see results of where the authors have used both untrimmed and trimmed reads as input for their pipeline, and whether there was any difference between these. * Lines 79-80: Could the authors clearly resolve the difference between "minimum computer requirements" later in the section and the "relatively low system requirements" mentioned in the same section. The authors also state "more threads are better", could a brief performance analysis be performed showing this. * Line 101: the authors mention that there is a "typical method" for calling variants for each file individually then merging them into one file. Could this "typical method" be explained or referenced. * Lines 103-106: The authors state that there are differences in their methods from the benchmarking paper to this new SNiPgenie pipeline, but the results are "virtually identical". Could the authors please A) mention what the new methods are and link this to the "standardised approach", and B) include data and/or figures to show a comparison of the two results. * Line 110: It would be good to put the default string on a new line, so that the code is separated clearly from the rest of the text. * Line 112: The authors mention four different alignment tools that can be used with the pipeline. Do these alignment tools produce different results? It would be good to see an example comparing the outputs from SNiPgenie using the different alignment choices. * Line 113: Oxford Nanopore should have capitalisation. * Line 115: There is an option for masking SNP sites from the output when using SNiPgenie. Can the authors please give example results comparing masked and unmasked SNP sites using the pipeline. * Line 118: It is mentioned that SNP sites for M. bovis will automatically be masked if using the "--species" flag and choosing this reference genome. Can the authors provide an example command showing this. * Line 144: Please can the authors expand on what the quality checks are. It is mentioned that they are a more basic version of FastQC. * Line 146: I found this a little hard to understand. Could the authors add in an example figure of the GUI with the table data and plots. * Line 148: "with the any" does not need "the" in the sentence * Line 149: It feels as though the mention of a RAxML phylogenetic tree is out of place here, as it has not been mentioned previously in the manuscript. I feel that it would be good for the authors to mention that a RAxML phylogeny can be drawn as part of the pipeline earlier in the paper, as this is a useful addition to the pipeline. It is mentioned on the github page that a RAxML phylogeny can be drawn with an additional flag and software.

    Please rate the manuscript for methodological rigour

    Poor

    Please rate the quality of the presentation and structure of the manuscript

    Satisfactory

    To what extent are the conclusions supported by the data?

    Partially support

    Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

    No

    Is there a potential financial or other conflict of interest between yourself and the author(s)?

    No

    If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

    Yes