Fast, lightweight, and accurate metagenomic functional profiling using FracMinHash sketches

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article



Functional profiling of metagenomic samples is essential to decipher the functional capabilities of these microbial communities. Traditional and more widely used functional profilers in the context of metagenomics rely on aligning reads against a known reference database. However, aligning sequencing reads against a large and fast-growing database is computationally expensive. In general, k -mer-based sketching techniques have been successfully used in metagenomics to address this bottleneck, notably in taxonomic profiling. In this work, we describe leveraging FracMinHash (implemented in sourmash , a publicly available software), a k -mer-sketching algorithm, to obtain functional profiles of metagenome samples. We show how pieces of the sourmash software (and the resulting FracMinHash sketches) can be put together in a pipeline to functionally profile a metagenomic sample.


We report that the functional profiles obtained using this pipeline demonstrate superior completeness and purity compared to the profiles obtained using other alignment-based methods when applied to simulated metagenomic data. At the same time, we also report that our functional profiling pipeline is 42-51x faster in CPU time, 10-15.8x faster in running time, and consumes up to 20% less memory. Coupled with the KEGG database, this method not only replicates fundamental biological insights but also highlights novel signals from the Human Microbiome Project datasets.


This fast and lightweight metagenomic functional profiler is freely available and can be accessed here: . All scripts of the analyses we present in this manuscript can be found on GitHub .

Article activity feed

  1. After all these filtered steps, we have 1747 high-quality data remaining for the downstream analysis, including547 healthy samples, 274 type 2 diabetes samples, and 926 samples related to inflammatory bowel disease.

    many of these sequences contain detectable human sequences. I would be curious for you to run the human genome against your databases and see what functional profile is returned. it would let users know whether they need to do host filtering before applying this approach. if they didn't need to take that step (or any other QC), that would be huge time savings

  2. -p protein, k=7, k=11, k=15, abund, scaled=1000

    how did you come up with parameters, and how do you know they are the best to use? how would you advise users to choose between k-mer sizes for their own applications?

  3. We used BBMap [9] to simulate a metagenome from 1000 randomly selected genomes from all 4498 bacterialgenomes present in the KEGG database.

    reiterating the point from above, how does your approach break down with increasing evolutionary divergence from the reference, and how is that different from other tools. Soil might be a good ecosystem to test drive this in, and I think the CAMISIM tool allows you to introduce mutations from a reference in a known ratio/identity etc

  4. The number of KOs (a total of only 25K) is much smallerthan the number of genes, and the number of k-mers in a KO is much larger than that of a single gene.Considering these factors, we designed our pipeline to invoke sourmash gather with a list of all KOs in theKEGG database, and then to output a list of KOs that ‘cover’ all observed k-mers in a given metagenome.

    I did some work similar to this with the pfam database a couple years ago:

    I'm curious if you did any sort of analysis to see if there is shared kmer content between orthologous groups, or if high shared content (as is observed in pfam) would limit the ability of this approach to be generalized to other databases.

  5. Using the functional profiles as input, we computed the pairwise FunUniFrac distances forT2D vs. HHS and performed MDS on the resulting pairwise distance matrices for visualization

    is the code for this also in the linked github repo? I couldn't find it, but I think it's an interesting application. It would be nice if something similar could be implemented for sourmash taxonomy results

  6. Next, we analyze the distinct functions among different conditions (Type 2 Diabetes, T2D; Healthy, HHS;and Inflammatory Bowel Disease, IBD). We conducted a LEfSe analyses [58] to unveil the key functionalunits/pathways that underlie the distinctions between the condition T2D vs. HHS and IBD vs. HHS.

    can you do these same analyses with a tool like HUMANN2 or something else that is typically used to do functional profiling and compare the results? can you show that you capture more functional units than other tools, or is your method only faster? would you need additional database above just KEGG to make the comparison fair, and is that possible with the approach you have outlined here?

    I think the 2019 HMP IBD paper has a supplemental figure where they have KOs for each sample. it would be interesting to compare against those results for those samples to see if you get the same or different results (super set, subset, etc).

  7. sourmash clearly is the better choice when high-coverage samples are available.

    I think this is too strong of a statement for the results presented. What about divergence in the metagenome vs. what's in the database? while using an amino acid k-mer will overcome some of this, I would expect diamond to better capture functional potential of a metagenome when the genomes are not in reference databases (I haven't explicitly done this test though so I don't know).

  8. We also found that KofamScan hasexceptionally high resource requirements, and yet did not show promising performance.

    again, I think this comparison is unfair since you aren't using assembled genomes

  9. On the other hand, the use of lightweight sketches allows sourmash to avoid alignmentaltogether, and identify the list of all present KOs more accurately, using fewer computational resources.

    this is not always a benefit. The "alignments" output by diamond can be super useful if the user wants to go back and do a targeted alignment of a specific gene of interest.

  10. We used two different k-mer sizes when running sourmash. Inthese experiments, we used a single active thread to run the sourmash gather program, and 64 threadsto run DIAMOND to generate these results. The computational resources (total CPU time and memory) togenerate these results are shown in Figure 2 (c and d).

    what about wall time? because diamond can be threaded, which is a huge plus, while sourmash gather cannot

  11. From our simulation experiments, we found that KofamScan fails to scale to metagenomes with millionsof reads (taking more than seven days to complete on a simulated metagenome with 1M reads) – makingit an impractical choice for this task. Nevertheless, because KofamScan was developed so closely with theKEGG database, we present the comparison in this manuscript.

    this doesn't make a lot of sense as an application though right? kofamscan is designed to run on ORFs predicted from assembled genomes, not on metagenome reads?

  12. he primary use of alignment-based algorithms makes these apoor practical choice in terms of scalability

    even more than this, many of these algorithms are limited to the setting of assembled (meta)genomes, and there are a substantial number of studies showing that short read assembly often fails for metagenomes, especially for those from complex communities. If your method can work directly on short reads, I think that is a huge strength that is worth highlighting

    (I believe DIAMOND-based approaches will also work quite white on short reads, but many of the others do not. while I have used diamond metagenomes against small databases [see the serratus rdrp paper for inspiration here], I'm not sure how well it would scale to whole metagenomes against all of e.g. KEGG).

  13. Thesemore popular alignment-based tools also lack the use of orthology relationships of the genes.

    This statement isn't clear to me. It seems like the KOALA and kofamSCAN algorithms do consider orthology, can you expand this statement to make it clear what this means?

  14. continue to turn to sketching-based methods, which are often faster andmore lightweight; and theoretical guarantees of the sketching algorithms ensure their high accuracy.

    can you provide citations for this point, both before and after the semi colon?