Parapipe: A Pipeline for Parasite Next-Generation Sequencing Data Analysis Applied to Cryptosporidium

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Cryptosporidium, a protozoan parasite of significant public health concern, is responsible for severe diarrheal disease, particularly in immunocompromised individuals and young children in resource-limited settings. Analysis of whole genome next generation sequencing (NGS) data is critical in improving our understanding of Cryptosporidium epidemiology, transmission, and diversity. However, effective analysis of NGS data in a public health context necessitates the development of robust, validated computational tools. We present Parapipe, an ISO-accreditable bioinformatic pipeline for high-throughput analysis of NGS data from Cryptosporidium and related taxa. Built using Nextflow DSL2 and containerised with Singularity, Parapipe is modular, portable, scalable, and designed for use by public health laboratories. Using both simulated and real Cryptosporidium datasets, we demonstrate the power of Parapipe’s genomic analysis for generating epidemiological insights. We highlight how whole genome analysis yields substantially greater phylogenetic resolution than conventional gp60 molecular typing in C. parvum. Uniquely, Parapipe facilitates the integration of mixed infection analysis and phylogenomic clustering with epidemiological metadata, representing a powerful tool in the investigation of complex transmission pathways and identification of outbreak sources. Parapipe significantly advances genomic surveillance of Cryptosporidium, offering a streamlined, reproducible analytical framework. By automating a complex workflow and delivering detailed genomic characterisation, Parapipe provides a valuable tool for public health agencies and researchers, supporting efforts to mitigate the global burden of cryptosporidiosis.

Article activity feed

  1. Dear Arthur, Thank you for your patience. I've now secured two reviews. I've selected major revise based on comments however, this is based on one component of the review and is at your discretion with how you address it. Please address concerns where possible. Specifically, point 1 of reviewer 1's comments around validation using other lineages. Adding additional validation will certainly strengthen the paper/tool but it is not a requirement. That said, being explicitly clear on the limitation of the study, validation steps and tools uses within the text, is. Addressing these observations will be helpful for readers who are less familiar with the individual pipeline tools and data types. Best wishes, John.

  2. Comments to Author

    The article by Morris et al describes the development of a new workflow/pipeline for the analysis of NGS data from the apicomplexan parasite Cryptosporidium spp. Although, originally designed for use in Crypto, the pipeline has clear utility for the analysis of NGS data from other related species and more broadly other parasites for which this type of resource is desperately needed. The pipeline (parapipe) clearly meets a defined need in the field of NGS data analysis, and I have no hesitation in recommending the paper for publication with a few minor changes as suggested below: 1. Line 99: I appreciate that it is pedantic, but please ensure that the full italicised species name is used at the start of a new sentence rather than the abbreviated name. The same applies on Line 300. 2. Line 103: Please confirm Reference 8 is the correct one. 3. Line 148: It is not clear from the figures provided, but does the pipeline provide a visual representation of the QC stages, i.e. read quality, mapping etc? 4. Line 156: The default setting is a threshold of 1 million reads in a sample. Is the pipeline also robust for lower yield samples, for example, those derived from dual RNAseq experiments, which may contain smaller numbers of parasite-derived reads, with the bulk of sequencing capacity taken up by host material? 5. Can the authors also clarify if advice is provided to the user on the minimum quality of sequencing data expected for the pipeline? 6. Is there a minimum spec of computer that is required to run the pipeline on a locally installed instance? 7. Line 203: Why was the Iowa II-ATCC reference genome selected. 8. Is the pipeline also usable with long-read sequencing data, for example, Pacbio or Oxford Nanopore datasets and can it deal with the higher error rates associated with this type of NGS data? 9. Line 251: This may be a typo, but can the authors confirm that the test set with 32,838,480 reads took 9.2 CPU hours to process, whilst the smaller test set with 1,216,240 reads took 13 CPU hours? 10. Line 319: The Hamming distance analysis is not mentioned in the Methods section, can the authors please add this in the appropriate section 11. Line 382: Please break this very long sentence up.

    Please rate the manuscript for methodological rigour

    Very good

    Please rate the quality of the presentation and structure of the manuscript

    Very good

    To what extent are the conclusions supported by the data?

    Strongly support

    Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

    No

    Is there a potential financial or other conflict of interest between yourself and the author(s)?

    No

    If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

    Yes

  3. Comments to Author

    The manuscript by Morris et al. presents Parapipe, a bioinformatics pipeline designed for the analysis of next-generation sequencing (NGS) data from protozoan parasites. This workflow integrates widely used NGS tools (e.g., Bowtie2, fastp, MultiQC, and Trim Galore) into a two-module pipeline implemented in Nextflow. The authors demonstrate its utility using both simulated and real datasets from apicomplexan parasite Cryptosporidium parvum, a known causative agent of cryptosporidiosis in mammals. I find that pipelines like Parapipe significantly enhance the accessibility of NGS data analysis for researchers without or with minimal training in bioinformatics. The results are presented clearly. However, I have several concerns regarding the methodological rigor that I believe should be addressed before the manuscript can be considered for publication. 1. My primary concern lies in the authors' claim that the pipeline is suitable for analyzing NGS data from a broad range of parasitic protists beyond Cryptosporidium parvum. However, the pipeline was only tested on Cryptosporidium, and no validation was performed on protists from other eukaryotic lineages. I am concerned that the pipeline may yield inaccurate or suboptimal results when applied to organisms with substantially different genomic characteristics (e.g., different ploidy levels or high heterozygosity) without appropriate parameter adjustments. I recommend that the authors either explicitly state that the pipeline is currently tailored for Cryptosporidium or for organisms with similar genome features (and clearly define those features), or alternatively, they should expand their validation to include representative species from eukaryotes outside Apicomplexa. Moreover, the authors should explicitly state that the pipeline is only suitable for the analysis of Illumina data. 2. Some aspects of the pipeline validation using Cryptosporidium data lack sufficient clarity. For instance, while I agree with the authors' assertion that a phylogenetic tree based on whole-genome SNP data is likely more robust than one based on a single gene, the manuscript does not provide adequate evidence for this. It is currently unclear which of the two tree topologies (Fig. 3A or 3B) is more likely. I recommend that the authors include branch support values on both trees and consider applying additional phylogenetic inference methods to strengthen their conclusions. Additionally, the meaning of the scale bar in Fig. 3A should be added to the figure legend. 3. I am somewhat puzzled by the inclusion of both fastp and Trim Galore in the pipeline, as these tools offer overlapping functionality—specifically, adapter trimming and quality filtering of sequencing reads. The rationale for using both tools, rather than selecting one, should be clearly explained in the manuscript to clarify whether each serves a distinct purpose within the workflow. 4. Fig. 3D and related analyses - please, provide the respective p-value for the correlation analysis. 5. Please, make sure that species and genus names are in italics throughout the manuscript. 6. Line 79. Please, change to something similar to "Cryptosporidium is a genus of single-celled parasitic organisms belonging to the eukaryotic group Apicomplexa, and its representatives are known to cause diarrheal disease in mammals". 7. Line 103 - add the respective reference.

    Please rate the manuscript for methodological rigour

    Satisfactory

    Please rate the quality of the presentation and structure of the manuscript

    Good

    To what extent are the conclusions supported by the data?

    Partially support

    Do you have any concerns of possible image manipulation, plagiarism or any other unethical practices?

    No

    Is there a potential financial or other conflict of interest between yourself and the author(s)?

    No

    If this manuscript involves human and/or animal work, have the subjects been treated in an ethical manner and the authors complied with the appropriate guidelines?

    No: No human and/or animal work reported