MuDoGeR: Multi-Domain Genome Recovery from metagenomes made easy

Ulisses Nunes da Rocha
Jonas Coelho Kasmanas
René Kallies
Joao Pedro Saraiva
Rodolfo Brizola Toscan
Polonca Štefanič
Marcos Fleming Bicalho
Felipe Borim Correa
Merve Nida Baştürk
Efthymios Fousekis
Luiz Miguel Viana Barbosa
Julia Plewka
Alexander Probst
Petr Baldrian
Peter Stadler
CLUE-TERRA consortium

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Several frameworks that recover genomes from Prokaryotes, Eukaryotes, and viruses from metagenomes exist. For those with little bioinformatics experience, it is difficult to evaluate quality, annotate genes, dereplicate, assign taxonomy and calculate relative abundance and coverage from genomes belonging to different domains. MuDoGeR is a user-friendly tool accessible for non-bioinformaticians that make genome recovery from metagenomes of Prokaryotes, Eukaryotes, and viruses alone or in combination easy. By testing MuDoGeR using 574 metagenomes and 24 genomes, we demonstrated users could run it in a few samples or high-throughput. MuDoGeR is an open-source software available at https://github.com/mdsufz/MuDoGeR .

Arcadia Science
Apr 14, 2023
t was designed to be an easy-to-use tool that outputs ready-to-use comprehensive files

I think that certain infrastructure improvements could be made to make this more user-friendly, stable, and adhere to best software engineering practices through implementing tests, better versioning of individual software packages that are included, etc. These are a few problems and potential solutions I see:
1. This pipeline requires managing very large conda environments, which can get out of hand very quickly in addition to potential difficulties with installation and solving environments. If the authors would like to stay with conda environments, a quick solution to solving environments and quicker installation would be using mamba to put these environments together.
2. Since the pipeline is written as a series of bash/R/python scripts depending on …
t was designed to be an easy-to-use tool that outputs ready-to-use comprehensive files

I think that certain infrastructure improvements could be made to make this more user-friendly, stable, and adhere to best software engineering practices through implementing tests, better versioning of individual software packages that are included, etc. These are a few problems and potential solutions I see:

This pipeline requires managing very large conda environments, which can get out of hand very quickly in addition to potential difficulties with installation and solving environments. If the authors would like to stay with conda environments, a quick solution to solving environments and quicker installation would be using mamba to put these environments together.

Since the pipeline is written as a series of bash/R/python scripts depending on conda environments, the pipeline is somewhat fragile and hard to ensure it works on most infrastructures, or even the intended infrastructure. Even if the actual installation process is made smoother, there is still the problem of verifying what versions of tools that were used in the pipeline. There is a way to export the conda environments and versions, but it's not a perfect solution. I think an involved pipeline like this would greatly benefit from being executed with a workflow manager such as Snakemake or Nextflow, with my personal opinion being that is should be implemented in Nextflow. Although Snakemake is easier to learn and can implement conda environments easier, it's difficult to ensure these pipelines will work on diverse platforms. Nextflow can also use conda environments but there is preference for Docker or singularity images, which solves some of the issues with keeping track of versions. Additionally Nextflow has testing and CI capability built in so that ensuring future updates are still functional and work as expected is easier. Finally, Nextflow has been tested on various platforms - from HPC schedulers, local environments, to cloud providers.

Related to the issue above, I don't see how this pipeline can be run in a high-throughput way because it isn't written as a DAG like what is implemented in Snakemake/Nextflow pipelines. My understanding is that you would have to run all of the samples together in more of a "for loop" fashion, and therefore this doesn't take advantage of HPC or cloud resources one might have. The only way somebody could use this in the cloud is if they used a single EC2 instance, which isn't very cost or time efficient. Making the pipeline truly high-throughput so samples can be run in parallel for certain tasks and then aggregated together requires DAG infrastructure.
Read the original source
Arcadia Science
Apr 14, 2023

on paired-end short-sequence reads generated by ILLUMINA machines, but future updates will include tools to work with data from long-read sequencing.

Related to my earlier comment, adding support for long reads will be made much easier if the underlying infrastructure is a workflow manager such as Snakemake or Nextflow. Additionally, even though there is an initial learning curve for learning these tools, communities such as nf-core already have a lot of pre-made and community-sourced modules to implement into workflows: https://nf-co.re/modules which would cut down on the time it takes to get your new features added into the pipeline

Read the original source
Arcadia Science
Apr 14, 2023

We tested the MuDoGeR pipeline using 598 metagenome libraries

I was expecting maybe more expanded results on the breakdown of MAG lineage recovery related to the biome the metagenome was from. Additionally it might be good to expand on why these metagenomes specifically were chosen - was it because they had a certain depth of sequencing or from certain biomes of interest? It might be good to selectively choose metagenomes from which you would be expecting eukaryotes to be in high abundance such as certain fermented foods for comparison to these other environments.

Read the original source
Arcadia Science
Apr 14, 2023

MuDoGeR v1.0 at a glance

One thing I am unclear about is how the pipeline or different modules handle if a single sample fails during a run if it will halt the entire pipeline or module? For example if the RAM calculation ends up being correct and for a single sample the assembly program runs out of memory, will this cause the pipeline to end? Is there some --resume functionality so you don't have to restart a pipeline from the beginning if there is a problem halfway through a module?

Read the original source
Arcadia Science
Apr 14, 2023

MuDoGeR was divided into five modules

I really appreciate that the pipeline was split into different modules so it encourages the user to manually check their data and outputs at various steps, and that you can run from various points instead of the entire thing.

Read the original source
Arcadia Science
Apr 14, 2023

MuDoGeR is open-source software available

I appreciate the very extensive documentation and examples for running the pipeline. I think the documentation would be better structured in a docs site such as readthedocs or mkdocs since this is such a long and extensive README. Oftentimes when scrolling through the page will freeze for me because there are several graphics and it's a long README without a table of contents to guide the user.

Read the original source
Arcadia Science
Apr 14, 2023

Biodiversity analysis with MuDoGeR

Is there final dereplication and checking of contigs between the different lineages to make sure the same contig didn't end up in multiple bins of different lineages?

Read the original source
Arcadia Science
Dec 9, 2022

We tested the MuDoGeR pipeline using 598 metagenome libraries

I was expecting maybe more expanded results on the breakdown of MAG lineage recovery related to the biome the metagenome was from. Additionally it might be good to expand on why these metagenomes specifically were chosen - was it because they had a certain depth of sequencing or from certain biomes of interest? It might be good to selectively choose metagenomes from which you would be expecting eukaryotes to be in high abundance such as certain fermented foods for comparison to these other environments.

Read the original source
Arcadia Science
Dec 9, 2022

Biodiversity analysis with MuDoGeR

Is there final dereplication and checking of contigs between the different lineages to make sure the same contig didn't end up in multiple bins of different lineages?

Read the original source
Arcadia Science
Dec 9, 2022

MuDoGeR is open-source software available

I appreciate the very extensive documentation and examples for running the pipeline. I think the documentation would be better structured in a docs site such as readthedocs or mkdocs since this is such a long and extensive README. Oftentimes when scrolling through the page will freeze for me because there are several graphics and it's a long README without a table of contents to guide the user.

Read the original source
Arcadia Science
Dec 9, 2022

MuDoGeR v1.0 at a glance

One thing I am unclear about is how the pipeline or different modules handle if a single sample fails during a run if it will halt the entire pipeline or module? For example if the RAM calculation ends up being correct and for a single sample the assembly program runs out of memory, will this cause the pipeline to end? Is there some --resume functionality so you don't have to restart a pipeline from the beginning if there is a problem halfway through a module?

Read the original source
Arcadia Science
Dec 9, 2022

on paired-end short-sequence reads generated by ILLUMINA machines, but future updates will include tools to work with data from long-read sequencing.

Related to my earlier comment, adding support for long reads will be made much easier if the underlying infrastructure is a workflow manager such as Snakemake or Nextflow. Additionally, even though there is an initial learning curve for learning these tools, communities such as nf-core already have a lot of pre-made and community-sourced modules to implement into workflows: https://nf-co.re/modules which would cut down on the time it takes to get your new features added into the pipeline

Read the original source
Arcadia Science
Dec 9, 2022
t was designed to be an easy-to-use tool that outputs ready-to-use comprehensive files

I think that certain infrastructure improvements could be made to make this more user-friendly, stable, and adhere to best software engineering practices through implementing tests, better versioning of individual software packages that are included, etc. These are a few problems and potential solutions I see:
1. This pipeline requires managing very large conda environments, which can get out of hand very quickly in addition to potential difficulties with installation and solving environments. If the authors would like to stay with conda environments, a quick solution to solving environments and quicker installation would be using mamba to put these environments together.
2. Since the pipeline is written as a series of bash/R/python scripts depending on …
t was designed to be an easy-to-use tool that outputs ready-to-use comprehensive files

I think that certain infrastructure improvements could be made to make this more user-friendly, stable, and adhere to best software engineering practices through implementing tests, better versioning of individual software packages that are included, etc. These are a few problems and potential solutions I see:

This pipeline requires managing very large conda environments, which can get out of hand very quickly in addition to potential difficulties with installation and solving environments. If the authors would like to stay with conda environments, a quick solution to solving environments and quicker installation would be using mamba to put these environments together.

Since the pipeline is written as a series of bash/R/python scripts depending on conda environments, the pipeline is somewhat fragile and hard to ensure it works on most infrastructures, or even the intended infrastructure. Even if the actual installation process is made smoother, there is still the problem of verifying what versions of tools that were used in the pipeline. There is a way to export the conda environments and versions, but it's not a perfect solution. I think an involved pipeline like this would greatly benefit from being executed with a workflow manager such as Snakemake or Nextflow, with my personal opinion being that is should be implemented in Nextflow. Although Snakemake is easier to learn and can implement conda environments easier, it's difficult to ensure these pipelines will work on diverse platforms. Nextflow can also use conda environments but there is preference for Docker or singularity images, which solves some of the issues with keeping track of versions. Additionally Nextflow has testing and CI capability built in so that ensuring future updates are still functional and work as expected is easier. Finally, Nextflow has been tested on various platforms - from HPC schedulers, local environments, to cloud providers.

Related to the issue above, I don't see how this pipeline can be run in a high-throughput way because it isn't written as a DAG like what is implemented in Snakemake/Nextflow pipelines. My understanding is that you would have to run all of the samples together in more of a "for loop" fashion, and therefore this doesn't take advantage of HPC or cloud resources one might have. The only way somebody could use this in the cloud is if they used a single EC2 instance, which isn't very cost or time efficient. Making the pipeline truly high-throughput so samples can be run in parallel for certain tasks and then aggregated together requires DAG infrastructure.
Read the original source
Arcadia Science
Dec 9, 2022

MuDoGeR was divided into five modules

I really appreciate that the pipeline was split into different modules so it encourages the user to manually check their data and outputs at various steps, and that you can run from various points instead of the entire thing.

Read the original source
Version published to 10.1101/2022.06.21.496983v3 on bioRxiv
Aug 11, 2022
Version published to 10.1101/2022.06.21.496983v2 on bioRxiv
Jun 24, 2022
Version published to 10.1101/2022.06.21.496983v1 on bioRxiv
Jun 21, 2022

Enhancing genome recovery across metagenomic samples using MAGmax

This article has 2 authors:
1. Arangasamy Yazhini
2. Johannes Söding
This article has no evaluationsLatest version Jun 1, 2025
GENERanno: A Genomic Foundation Model for Metagenomic Annotation

This article has 6 authors:
1. Qiuyi Li
2. Wei Wu
3. Yiheng Zhu
4. Fuli Feng
5. Jieping Ye
6. Zheng Wang
This article has no evaluationsLatest version Jul 4, 2025
The planktonic microbiome of the Great Barrier Reef

This article has 11 authors:
1. Steven Robbins
2. Katherine Dougan
3. Marko Terzin
4. Julian Zaugg
5. Sara C. Bell
6. Patrick W. Laffy
7. J. Pamela Engelberts
8. Nicole S. Webster
9. Philip Hugenholtz
10. David G. Bourne
11. Yun Kit Yeoh
This article has no evaluationsLatest version May 16, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

Enhancing genome recovery across metagenomic samples using MAGmax

GENERanno: A Genomic Foundation Model for Metagenomic Annotation

The planktonic microbiome of the Great Barrier Reef