MTBseq-nf: Enabling Scalable Tuberculosis Genomics “Big Data” Analysis through a User-Friendly Nextflow Wrapper for MTBseq pipeline

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The MTBseq pipeline, published in 2018, was designed to address bioinformatics challenges in tuberculosis research using whole-genome sequencing data. It was the first publicly available pipeline on Github to perform full analysis of WGS data for M. tuberculosis encompassing quality control through mapping, variant calling for lineage classification, drug resistance prediction, and phylogenetic inference. However, the pipeline’s architecture is not optimal for high-performance computing or cloud computing environments, which often require large datasets. To optimize the pipeline, a Nextflow wrapper MTBseq-nf, was created which offers shorter execution times through the parallel mode along with multiple other thematic improvements. The MTBseq-nf wrapper, as opposed to the linear batched analysis of samples in TBfull step of MTBseq pipeline, can execute multiple instances of the same step in parallel and therefore makes full use of the provided computational resources. For evaluation of scalability and reproducibility, we used 90 M. tuberculosis genomes (ENA accession PRJEB7727) for the benchmarking analysis on a dedicated computing server. In our experiments the execution time of MTBseq-nf parallel analysis mode is at least twice as fast as the standard MTBseq pipeline for more than 20 samples. Furthermore, the MTBseq-nf wrapper facilitates reproducibility using the nf-core, bioconda, and biocontainers projects for platform independence. The proposed MTBseq-nf wrapper pipeline is a user-friendly pipeline optimized for hardware efficiency, scalability for larger datasets, and improved reproducibility.

Article activity feed