MuDoGeR: Multi-Domain Genome Recovery from metagenomes made easy
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
Several frameworks that recover genomes from Prokaryotes, Eukaryotes, and viruses from metagenomes exist. For those with little bioinformatics experience, it is difficult to evaluate quality, annotate genes, dereplicate, assign taxonomy and calculate relative abundance and coverage from genomes belonging to different domains. MuDoGeR is a user-friendly tool accessible for non-bioinformaticians that make genome recovery from metagenomes of Prokaryotes, Eukaryotes, and viruses alone or in combination easy. By testing MuDoGeR using 574 metagenomes and 24 genomes, we demonstrated users could run it in a few samples or high-throughput. MuDoGeR is an open-source software available at https://github.com/mdsufz/MuDoGeR .
Article activity feed
-
t was designed to be an easy-to-use tool that outputs ready-to-use comprehensive files
I think that certain infrastructure improvements could be made to make this more user-friendly, stable, and adhere to best software engineering practices through implementing tests, better versioning of individual software packages that are included, etc. These are a few problems and potential solutions I see:
This pipeline requires managing very large conda environments, which can get out of hand very quickly in addition to potential difficulties with installation and solving environments. If the authors would like to stay with conda environments, a quick solution to solving environments and quicker installation would be using mamba to put these environments together.
Since the pipeline is written as a series of bash/R/python scripts depending on …
t was designed to be an easy-to-use tool that outputs ready-to-use comprehensive files
I think that certain infrastructure improvements could be made to make this more user-friendly, stable, and adhere to best software engineering practices through implementing tests, better versioning of individual software packages that are included, etc. These are a few problems and potential solutions I see:
This pipeline requires managing very large conda environments, which can get out of hand very quickly in addition to potential difficulties with installation and solving environments. If the authors would like to stay with conda environments, a quick solution to solving environments and quicker installation would be using mamba to put these environments together.
Since the pipeline is written as a series of bash/R/python scripts depending on conda environments, the pipeline is somewhat fragile and hard to ensure it works on most infrastructures, or even the intended infrastructure. Even if the actual installation process is made smoother, there is still the problem of verifying what versions of tools that were used in the pipeline. There is a way to export the conda environments and versions, but it's not a perfect solution. I think an involved pipeline like this would greatly benefit from being executed with a workflow manager such as Snakemake or Nextflow, with my personal opinion being that is should be implemented in Nextflow. Although Snakemake is easier to learn and can implement conda environments easier, it's difficult to ensure these pipelines will work on diverse platforms. Nextflow can also use conda environments but there is preference for Docker or singularity images, which solves some of the issues with keeping track of versions. Additionally Nextflow has testing and CI capability built in so that ensuring future updates are still functional and work as expected is easier. Finally, Nextflow has been tested on various platforms - from HPC schedulers, local environments, to cloud providers.
Related to the issue above, I don't see how this pipeline can be run in a high-throughput way because it isn't written as a DAG like what is implemented in Snakemake/Nextflow pipelines. My understanding is that you would have to run all of the samples together in more of a "for loop" fashion, and therefore this doesn't take advantage of HPC or cloud resources one might have. The only way somebody could use this in the cloud is if they used a single EC2 instance, which isn't very cost or time efficient. Making the pipeline truly high-throughput so samples can be run in parallel for certain tasks and then aggregated together requires DAG infrastructure.
-
on paired-end short-sequence reads generated by ILLUMINA machines, but future updates will include tools to work with data from long-read sequencing.
Related to my earlier comment, adding support for long reads will be made much easier if the underlying infrastructure is a workflow manager such as Snakemake or Nextflow. Additionally, even though there is an initial learning curve for learning these tools, communities such as nf-core already have a lot of pre-made and community-sourced modules to implement into workflows: https://nf-co.re/modules which would cut down on the time it takes to get your new features added into the pipeline
-
We tested the MuDoGeR pipeline using 598 metagenome libraries
I was expecting maybe more expanded results on the breakdown of MAG lineage recovery related to the biome the metagenome was from. Additionally it might be good to expand on why these metagenomes specifically were chosen - was it because they had a certain depth of sequencing or from certain biomes of interest? It might be good to selectively choose metagenomes from which you would be expecting eukaryotes to be in high abundance such as certain fermented foods for comparison to these other environments.
-
MuDoGeR v1.0 at a glance
One thing I am unclear about is how the pipeline or different modules handle if a single sample fails during a run if it will halt the entire pipeline or module? For example if the RAM calculation ends up being correct and for a single sample the assembly program runs out of memory, will this cause the pipeline to end? Is there some --resume functionality so you don't have to restart a pipeline from the beginning if there is a problem halfway through a module?
-
MuDoGeR was divided into five modules
I really appreciate that the pipeline was split into different modules so it encourages the user to manually check their data and outputs at various steps, and that you can run from various points instead of the entire thing.
-
MuDoGeR is open-source software available
I appreciate the very extensive documentation and examples for running the pipeline. I think the documentation would be better structured in a docs site such as readthedocs or mkdocs since this is such a long and extensive README. Oftentimes when scrolling through the page will freeze for me because there are several graphics and it's a long README without a table of contents to guide the user.
-
Biodiversity analysis with MuDoGeR
Is there final dereplication and checking of contigs between the different lineages to make sure the same contig didn't end up in multiple bins of different lineages?
-
We tested the MuDoGeR pipeline using 598 metagenome libraries
I was expecting maybe more expanded results on the breakdown of MAG lineage recovery related to the biome the metagenome was from. Additionally it might be good to expand on why these metagenomes specifically were chosen - was it because they had a certain depth of sequencing or from certain biomes of interest? It might be good to selectively choose metagenomes from which you would be expecting eukaryotes to be in high abundance such as certain fermented foods for comparison to these other environments.
-
Biodiversity analysis with MuDoGeR
Is there final dereplication and checking of contigs between the different lineages to make sure the same contig didn't end up in multiple bins of different lineages?
-
MuDoGeR is open-source software available
I appreciate the very extensive documentation and examples for running the pipeline. I think the documentation would be better structured in a docs site such as readthedocs or mkdocs since this is such a long and extensive README. Oftentimes when scrolling through the page will freeze for me because there are several graphics and it's a long README without a table of contents to guide the user.
-
MuDoGeR v1.0 at a glance
One thing I am unclear about is how the pipeline or different modules handle if a single sample fails during a run if it will halt the entire pipeline or module? For example if the RAM calculation ends up being correct and for a single sample the assembly program runs out of memory, will this cause the pipeline to end? Is there some --resume functionality so you don't have to restart a pipeline from the beginning if there is a problem halfway through a module?
-
on paired-end short-sequence reads generated by ILLUMINA machines, but future updates will include tools to work with data from long-read sequencing.
Related to my earlier comment, adding support for long reads will be made much easier if the underlying infrastructure is a workflow manager such as Snakemake or Nextflow. Additionally, even though there is an initial learning curve for learning these tools, communities such as nf-core already have a lot of pre-made and community-sourced modules to implement into workflows: https://nf-co.re/modules which would cut down on the time it takes to get your new features added into the pipeline
-
t was designed to be an easy-to-use tool that outputs ready-to-use comprehensive files
I think that certain infrastructure improvements could be made to make this more user-friendly, stable, and adhere to best software engineering practices through implementing tests, better versioning of individual software packages that are included, etc. These are a few problems and potential solutions I see:
This pipeline requires managing very large conda environments, which can get out of hand very quickly in addition to potential difficulties with installation and solving environments. If the authors would like to stay with conda environments, a quick solution to solving environments and quicker installation would be using mamba to put these environments together.
Since the pipeline is written as a series of bash/R/python scripts depending on …
t was designed to be an easy-to-use tool that outputs ready-to-use comprehensive files
I think that certain infrastructure improvements could be made to make this more user-friendly, stable, and adhere to best software engineering practices through implementing tests, better versioning of individual software packages that are included, etc. These are a few problems and potential solutions I see:
This pipeline requires managing very large conda environments, which can get out of hand very quickly in addition to potential difficulties with installation and solving environments. If the authors would like to stay with conda environments, a quick solution to solving environments and quicker installation would be using mamba to put these environments together.
Since the pipeline is written as a series of bash/R/python scripts depending on conda environments, the pipeline is somewhat fragile and hard to ensure it works on most infrastructures, or even the intended infrastructure. Even if the actual installation process is made smoother, there is still the problem of verifying what versions of tools that were used in the pipeline. There is a way to export the conda environments and versions, but it's not a perfect solution. I think an involved pipeline like this would greatly benefit from being executed with a workflow manager such as Snakemake or Nextflow, with my personal opinion being that is should be implemented in Nextflow. Although Snakemake is easier to learn and can implement conda environments easier, it's difficult to ensure these pipelines will work on diverse platforms. Nextflow can also use conda environments but there is preference for Docker or singularity images, which solves some of the issues with keeping track of versions. Additionally Nextflow has testing and CI capability built in so that ensuring future updates are still functional and work as expected is easier. Finally, Nextflow has been tested on various platforms - from HPC schedulers, local environments, to cloud providers.
Related to the issue above, I don't see how this pipeline can be run in a high-throughput way because it isn't written as a DAG like what is implemented in Snakemake/Nextflow pipelines. My understanding is that you would have to run all of the samples together in more of a "for loop" fashion, and therefore this doesn't take advantage of HPC or cloud resources one might have. The only way somebody could use this in the cloud is if they used a single EC2 instance, which isn't very cost or time efficient. Making the pipeline truly high-throughput so samples can be run in parallel for certain tasks and then aggregated together requires DAG infrastructure.
-
MuDoGeR was divided into five modules
I really appreciate that the pipeline was split into different modules so it encourages the user to manually check their data and outputs at various steps, and that you can run from various points instead of the entire thing.
-
-
-