MetaCerberus: distributed highly parallelized scalable HMM-based implementation for robust functional annotation across the tree of life
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (Arcadia Science)
Abstract
Summary
MetaCerberus is an exclusive HMM/HMMER-based tool that is massively parallel, on low memory, and provides rapid scalable annotation for functional gene inference across genomes to metacommunities. It provides robust enumeration of functional genes and pathways across many current public databases including KEGG (KO), COGs, CAZy, FOAM, and viral specific databases (i.e., VOGs and PHROGs). In a direct comparison, MetaCerberus was twice as fast as EggNOG-Mapper, and produced better annotation of viruses, phages, and archaeal viruses than DRAM, PROKKA, or InterProScan. MetaCerberus annotates more KOs across domains when compared to DRAM, with a 186x smaller database and a third less memory. MetaCerberus is fully integrated with differential statistical tools (i.e., DESeq2 and edgeR), pathway enrichment (GAGE R), and Pathview R for quantitative elucidation of metabolic pathways. MetaCerberus implements the key to unlocking the biosphere across the tree of life at scale.
Availability and implementation
MetaCerberus is written in Python and distributed under a BSD-3 license. The source code of MetaCerberus is freely available at https://github.com/raw-lab/metacerberus . Written in python 3 for both Linux and Mac OS X. MetaCerberus can also be easily installed using mamba create –n metacerberus –c bioconda –c conda-forge metacerberus
Article activity feed
-
We compared MetaCerberus to DRAM, InterProScan, and PROKKA for the time used per genome, RAM utilization, and disk space used across 100 randomly selected bacterial genomes within GTDB
Were more complex metagenomes tested for performance, such as inputting raw reads? It would be good to give the user an expectation of resources/time for raw reads from different communities based on complexity/read depth
-
map-based heatmaps
It would be great if one of the example heatmaps was shown here in the heatmap for demonstration, or a longer tutorial in the Wiki of the Github repo for example. Although not sure if this is output as part of the HTML dashboard?
-
A sample dashboard visualization
I'm assuming this output is an interactive HTML since the supplementary figure looks like a screenshot?
-
PacBio, fastp
Reason for trimming PacBio data? Usually PacBio Hifi data is high-quality enough that this isn't necessary
-
Porechop
Any plans to use instead Porechop_ABI https://github.com/bonsai-team/Porechop_ABI? Porechop is no longer being actively maintained
-
mamba create –n metacerberus –c bioconda –c conda-forge metacerberus
Awesome having this right at the beginning, I opened a Github issue but wanted to point out I had an error trying to install with this command
-
Databases for MetaCerberus
From scanning the documentation in the github README, it's quite far down that the databases are on OSF and there aren't instructions about if the databases need to be placed in a specific folder or to be pointed to when running the command. Does this happen with the --setup command run after installing?
-
(pORFs)
By this point in the introduction there are already quite a few abbreviations for which there probably don't need to be such as pORFs and massively parallel sequencing. There's already a lot of abbreviations for software names and MAGs for example, so see if some can be cut down that aren't necessary?
-
PROKKA
A small nit but Prokka isn't all capitalized
-