MetaCerberus: distributed highly parallelized scalable HMM-based implementation for robust functional annotation across the tree of life

Jose L. Figueroa
Eliza Dhungel
Cory R. Brouwer
Richard Allen White

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Summary

MetaCerberus is an exclusive HMM/HMMER-based tool that is massively parallel, on low memory, and provides rapid scalable annotation for functional gene inference across genomes to metacommunities. It provides robust enumeration of functional genes and pathways across many current public databases including KEGG (KO), COGs, CAZy, FOAM, and viral specific databases (i.e., VOGs and PHROGs). In a direct comparison, MetaCerberus was twice as fast as EggNOG-Mapper, and produced better annotation of viruses, phages, and archaeal viruses than DRAM, PROKKA, or InterProScan. MetaCerberus annotates more KOs across domains when compared to DRAM, with a 186x smaller database and a third less memory. MetaCerberus is fully integrated with differential statistical tools (i.e., DESeq2 and edgeR), pathway enrichment (GAGE R), and Pathview R for quantitative elucidation of metabolic pathways. MetaCerberus implements the key to unlocking the biosphere across the tree of life at scale.

Availability and implementation

MetaCerberus is written in Python and distributed under a BSD-3 license. The source code of MetaCerberus is freely available at https://github.com/raw-lab/metacerberus . Written in python 3 for both Linux and Mac OS X. MetaCerberus can also be easily installed using mamba create –n metacerberus –c bioconda –c conda-forge metacerberus

Arcadia Science
Aug 21, 2023

We compared MetaCerberus to DRAM, InterProScan, and PROKKA for the time used per genome, RAM utilization, and disk space used across 100 randomly selected bacterial genomes within GTDB

Were more complex metagenomes tested for performance, such as inputting raw reads? It would be good to give the user an expectation of resources/time for raw reads from different communities based on complexity/read depth

Read the original source
Arcadia Science
Aug 21, 2023

map-based heatmaps

It would be great if one of the example heatmaps was shown here in the heatmap for demonstration, or a longer tutorial in the Wiki of the Github repo for example. Although not sure if this is output as part of the HTML dashboard?

Read the original source
Arcadia Science
Aug 21, 2023

A sample dashboard visualization

I'm assuming this output is an interactive HTML since the supplementary figure looks like a screenshot?

Read the original source
Arcadia Science
Aug 21, 2023

PacBio, fastp

Reason for trimming PacBio data? Usually PacBio Hifi data is high-quality enough that this isn't necessary

Read the original source
Arcadia Science
Aug 21, 2023

Porechop

Any plans to use instead Porechop_ABI https://github.com/bonsai-team/Porechop_ABI? Porechop is no longer being actively maintained

Read the original source
Arcadia Science
Aug 21, 2023

mamba create –n metacerberus –c bioconda –c conda-forge metacerberus

Awesome having this right at the beginning, I opened a Github issue but wanted to point out I had an error trying to install with this command

Read the original source
Arcadia Science
Aug 21, 2023

Databases for MetaCerberus

From scanning the documentation in the github README, it's quite far down that the databases are on OSF and there aren't instructions about if the databases need to be placed in a specific folder or to be pointed to when running the command. Does this happen with the --setup command run after installing?

Read the original source
Arcadia Science
Aug 21, 2023

(pORFs)

By this point in the introduction there are already quite a few abbreviations for which there probably don't need to be such as pORFs and massively parallel sequencing. There's already a lot of abbreviations for software names and MAGs for example, so see if some can be cut down that aren't necessary?

Read the original source
Arcadia Science
Aug 21, 2023

PROKKA

A small nit but Prokka isn't all capitalized

Read the original source
Version published to 10.1101/2023.08.10.552700v1 on bioRxiv
Aug 12, 2023

Snekmer Learn/Apply: A kmer-based vector similarity approach to protein classification suitable for metagenomic datasets

This article has 8 authors:
1. Tara A. Nitka
2. Jeremy Jacobson
3. Christine H Chang
4. Genevieve R. Krause
5. Travis J. Wheeler
6. Robert G. Egbert
7. William C Nelson
8. Jason E McDermott
This article has no evaluationsLatest version May 18, 2025
WHOOPER Web application for Hands-On identification of proteins co-Occurrence among Phyla, focused on user ERgonomics

This article has 6 authors:
1. Sylvain Marthey
2. Natacha Baffo
3. Véronique Martin
4. Laiqa Zia Lodhi
5. María-Natalia Lisa
6. Gwenaëlle André
This article has no evaluationsLatest version Jun 24, 2025
TaxonReportViewer: Parsing and Visualizing Taxonomic Hierarchies in Metagenomic Datasets

This article has 2 authors:
1. Emanuel Razzolini
2. Claudia Regina de Souza
This article has no evaluationsLatest version Jun 10, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Summary

Availability and implementation

Article activity feed

Related articles

Snekmer Learn/Apply: A kmer-based vector similarity approach to protein classification suitable for metagenomic datasets

WHOOPER Web application for Hands-On identification of proteins co-Occurrence among Phyla, focused on user ERgonomics

TaxonReportViewer: Parsing and Visualizing Taxonomic Hierarchies in Metagenomic Datasets