Fast, lightweight, and accurate metagenomic functional profiling using FracMinHash sketches

Mahmudur Rahman Hera
Shaopeng Liu
Wei Wei
Judith S. Rodriguez
Chunyu Ma
David Koslicki

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Motivation: Functional profiling of metagenomic samples is essential to decipher the functional capabilities of microbial communities. Traditional and more widely used functional profilers in the context of metagenomics rely on aligning reads against a known reference database. However, aligning sequencing reads against a large and fast-growing database is computationally expensive. In general, k -mer-based sketching techniques have been successfully used in metagenomics to address this bottleneck, notably in taxonomic profiling. In this work, we describe leveraging FracMinHash (implemented in sourmash, a publicly available software), a k -mer-sketching algorithm, to obtain functional profiles of metagenome samples. Results: We show how pieces of the sourmash software (and the resulting FracMinHash sketches) can be put together in a pipeline to functionally profile a metagenomic sample. We named our pipeline fmh-funprofiler. We report that the functional profiles obtained using this pipeline demonstrate comparable completeness and better purity compared to the profiles obtained using other alignment-based methods when applied to simulated metagenomic data. We also report that fmh-funprofiler is 39-99x faster in wall-clock time, and consumes up to 40-55x less memory. Coupled with the KEGG database, this method not only replicates fundamental biological insights but also highlights novel signals from the Human Microbiome Project datasets. Reproducibility: This fast and lightweight metagenomic functional profiler is freely available and can be accessed here: <a href="https://github.com/KoslickiLab/fmh-funprofiler">https://github.com/KoslickiLab/fmh-funprofiler</a>. All scripts of the analyses we present in this manuscript can be found on <a href="https://github.com/KoslickiLab/KEGG_sketching_annotation_reproducibles">GitHub</a>

Version published to 10.1101/2023.11.06.565843v3 on bioRxiv
Jul 20, 2024
Version published to 10.1101/2023.11.06.565843v2 on bioRxiv
Apr 5, 2024
Arcadia Science
Nov 7, 2023

pathways

how did you deal with shared KOs between pathways when doing pathway level analysis?

Read the original source
Arcadia Science
Nov 7, 2023

After all these filtered steps, we have 1747 high-quality data remaining for the downstream analysis, including547 healthy samples, 274 type 2 diabetes samples, and 926 samples related to inflammatory bowel disease.

many of these sequences contain detectable human sequences. I would be curious for you to run the human genome against your databases and see what functional profile is returned. it would let users know whether they need to do host filtering before applying this approach. if they didn't need to take that step (or any other QC), that would be huge time savings

Read the original source
Arcadia Science
Nov 7, 2023

-p protein, k=7, k=11, k=15, abund, scaled=1000

how did you come up with parameters, and how do you know they are the best to use? how would you advise users to choose between k-mer sizes for their own applications?

Read the original source
Arcadia Science
Nov 7, 2023

We used BBMap [9] to simulate a metagenome from 1000 randomly selected genomes from all 4498 bacterialgenomes present in the KEGG database.

reiterating the point from above, how does your approach break down with increasing evolutionary divergence from the reference, and how is that different from other tools. Soil might be a good ecosystem to test drive this in, and I think the CAMISIM tool allows you to introduce mutations from a reference in a known ratio/identity etc

Read the original source
Arcadia Science
Nov 7, 2023

The number of KOs (a total of only 25K) is much smallerthan the number of genes, and the number of k-mers in a KO is much larger than that of a single gene.Considering these factors, we designed our pipeline to invoke sourmash gather with a list of all KOs in theKEGG database, and then to output a list of KOs that ‘cover’ all observed k-mers in a given metagenome.

I did some work similar to this with the pfam database a couple years ago: https://github.com/taylorreiter/2021-pfam-shared-kmers

I'm curious if you did any sort of analysis to see if there is shared kmer content between orthologous groups, or if high shared content (as is observed in pfam) would limit the ability of this approach to be generalized to other databases.

Read the original source
Arcadia Science
Nov 7, 2023

tinier

shorter

Read the original source
Arcadia Science
Nov 7, 2023

Using the functional profiles as input, we computed the pairwise FunUniFrac distances forT2D vs. HHS and performed MDS on the resulting pairwise distance matrices for visualization

is the code for this also in the linked github repo? I couldn't find it, but I think it's an interesting application. It would be nice if something similar could be implemented for sourmash taxonomy results

Read the original source
Arcadia Science
Nov 7, 2023

pairwise distances between KOs obtained usingsourmash sketch

pairwise distances between KOs obtained by comparing sourmash sketches, right?

Read the original source
Arcadia Science
Nov 7, 2023

Next, we analyze the distinct functions among different conditions (Type 2 Diabetes, T2D; Healthy, HHS;and Inflammatory Bowel Disease, IBD). We conducted a LEfSe analyses [58] to unveil the key functionalunits/pathways that underlie the distinctions between the condition T2D vs. HHS and IBD vs. HHS.

can you do these same analyses with a tool like HUMANN2 or something else that is typically used to do functional profiling and compare the results? can you show that you capture more functional units than other tools, or is your method only faster? would you need additional database above just KEGG to make the comparison fair, and is that possible with the approach you have outlined here?

I think the 2019 HMP IBD paper has a supplemental figure where they have KOs for each sample. it would be interesting to compare against those results …

Next, we analyze the distinct functions among different conditions (Type 2 Diabetes, T2D; Healthy, HHS;and Inflammatory Bowel Disease, IBD). We conducted a LEfSe analyses [58] to unveil the key functionalunits/pathways that underlie the distinctions between the condition T2D vs. HHS and IBD vs. HHS.

can you do these same analyses with a tool like HUMANN2 or something else that is typically used to do functional profiling and compare the results? can you show that you capture more functional units than other tools, or is your method only faster? would you need additional database above just KEGG to make the comparison fair, and is that possible with the approach you have outlined here?

I think the 2019 HMP IBD paper has a supplemental figure where they have KOs for each sample. it would be interesting to compare against those results for those samples to see if you get the same or different results (super set, subset, etc).

Read the original source
Arcadia Science
Nov 7, 2023

sourmash clearly is the better choice when high-coverage samples are available.

I think this is too strong of a statement for the results presented. What about divergence in the metagenome vs. what's in the database? while using an amino acid k-mer will overcome some of this, I would expect diamond to better capture functional potential of a metagenome when the genomes are not in reference databases (I haven't explicitly done this test though so I don't know).

Read the original source
Arcadia Science
Nov 7, 2023

We also found that KofamScan hasexceptionally high resource requirements, and yet did not show promising performance.

again, I think this comparison is unfair since you aren't using assembled genomes

Read the original source
Arcadia Science
Nov 7, 2023

On the other hand, the use of lightweight sketches allows sourmash to avoid alignmentaltogether, and identify the list of all present KOs more accurately, using fewer computational resources.

this is not always a benefit. The "alignments" output by diamond can be super useful if the user wants to go back and do a targeted alignment of a specific gene of interest.

Read the original source
Arcadia Science
Nov 7, 2023

We used two different k-mer sizes when running sourmash. Inthese experiments, we used a single active thread to run the sourmash gather program, and 64 threadsto run DIAMOND to generate these results. The computational resources (total CPU time and memory) togenerate these results are shown in Figure 2 (c and d).

what about wall time? because diamond can be threaded, which is a huge plus, while sourmash gather cannot

Read the original source
Arcadia Science
Nov 7, 2023

From our simulation experiments, we found that KofamScan fails to scale to metagenomes with millionsof reads (taking more than seven days to complete on a simulated metagenome with 1M reads) – makingit an impractical choice for this task. Nevertheless, because KofamScan was developed so closely with theKEGG database, we present the comparison in this manuscript.

this doesn't make a lot of sense as an application though right? kofamscan is designed to run on ORFs predicted from assembled genomes, not on metagenome reads?

Read the original source
Arcadia Science
Nov 7, 2023

The pipeline is freely available and can be accessed here:https://github.com/KoslickiLab/funprofiler

I noticed that this repo doesn't have any unit tests and that the python script only contains 58 lines of code. Would it be possible to include this approach directly in sourmash?

Read the original source
Arcadia Science
Nov 7, 2023

he primary use of alignment-based algorithms makes these apoor practical choice in terms of scalability

even more than this, many of these algorithms are limited to the setting of assembled (meta)genomes, and there are a substantial number of studies showing that short read assembly often fails for metagenomes, especially for those from complex communities. If your method can work directly on short reads, I think that is a huge strength that is worth highlighting

(I believe DIAMOND-based approaches will also work quite white on short reads, but many of the others do not. while I have used diamond metagenomes against small databases [see the serratus rdrp paper for inspiration here], I'm not sure how well it would scale to whole metagenomes against all of e.g. KEGG).

Read the original source
Arcadia Science
Nov 7, 2023

KOs

Aren't they called KEGG Orthologs, which is abbreviated to KOs?

Read the original source
Arcadia Science
Nov 7, 2023

Thesemore popular alignment-based tools also lack the use of orthology relationships of the genes.

This statement isn't clear to me. It seems like the KOALA and kofamSCAN algorithms do consider orthology, can you expand this statement to make it clear what this means?

Read the original source
Arcadia Science
Nov 7, 2023

continue to turn to sketching-based methods, which are often faster andmore lightweight; and theoretical guarantees of the sketching algorithms ensure their high accuracy.

can you provide citations for this point, both before and after the semi colon?

Read the original source
Arcadia Science
Nov 7, 2023

east common ancestor

last or lowest common ancestor?

Read the original source
Version published to 10.1101/2023.11.06.565843v1 on bioRxiv
Nov 6, 2023

MADRe: Strain-Level Metagenomic Classification Through Assembly-Driven Database Reduction

This article has 4 authors:
1. Josipa Lipovac
2. Mile Šikić
3. Riccardo Vicedomini
4. Krešimir Križanović
This article has no evaluationsLatest version May 15, 2025
2Pipe: It Starts with a Question. Matching You with the Correct Pipeline for MAG Reconstruction

This article has 2 authors:
1. Jeferyd Yepes Garcí
2. Laurent Falquet
This article has no evaluationsLatest version Jun 9, 2025
TaxonReportViewer: Parsing and Visualizing Taxonomic Hierarchies in Metagenomic Datasets

This article has 2 authors:
1. Emanuel Razzolini
2. Claudia Regina de Souza
This article has no evaluationsLatest version Jun 10, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

MADRe: Strain-Level Metagenomic Classification Through Assembly-Driven Database Reduction

2Pipe: It Starts with a Question. Matching You with the Correct Pipeline for MAG Reconstruction

TaxonReportViewer: Parsing and Visualizing Taxonomic Hierarchies in Metagenomic Datasets