Meta-Prism 2.0: Enabling algorithm for ultra-fast, accurate and memory-efficient search among millions of microbial community samples

Abstract

Motivation

Microbial community samples and sequencing data have been accumulated at a speed faster than ever, with tens of thousands of samples been sequenced each year. Mining such a huge amount of multi-source heterogeneous data is becoming more and more difficult. Among several sample mining bottlenecks, efficient and accurate search of samples is one of the most prominent: Faced with millions of samples in the data repository, traditional sample comparison and search approaches fall short in speed and accuracy.

Results

Here we proposed Meta-Prism 2.0, a microbial community sample search method based on smart pair-wise sample comparison, which pushed the time and memory efficiency to a new limit, without the compromise of accuracy. Based on memory-saving data structure, time-saving instruction pipeline, and boost scheme optimization, Meta-Prism 2.0 has enabled ultra-fast, accurate and memory-efficient search among millions of samples. Meta-Prism 2.0 has been put to test on several datasets, with largest containing one million samples. Results have shown that firstly, as a distance-based method, Meta-Prism 2.0 is not only faster than other distance-based methods, but also faster than unsupervised methods. Its 0.00001s per sample pair search speed, as well as 8GB memory needs for searching against one million samples, have enabled it to be the most efficient method for sample comparison. Additionally, Meta-Prism 2.0 could achieve the comparison accuracy and search precision that are comparable or better than other contemporary methods. Thirdly, Meta-Prism 2.0 can precisely identify the original biome for samples, thus enabling sample source tracking.

Conclusion

In summary, Meta-Prism 2.0 can perform accurate searches among millions of samples with very low memory cost and fast speed, enabling knowledge discovery from samples at a massive scale. It has changed the traditional resource-intensive sample comparison and search scheme to a cheap and effective procedure, which could be conducted by researchers everyday even on a laptop, for insightful sample search and knowledge discovery. Meta-Prism 2.0 could be accessed at: https://github.com/HUST-NingKang-Lab/Meta-Prism-2.0 .

Abstract

This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giac073 and has published the reviews under the same license.

**Reviewer 1. Siyuan Ma **

Reviewer Comments to Author: In Kang, Chong, and Ning, the authors present Meta-Prism 2, a microbial community analysis framework, which calculates sample-sample dissimilarities and queries microbial profiles similar to those of user-provided targets. Meta-Prism 2 adopts efficient algorithms to achieve the time and memory efficiency required for modern microbiome "big data" application scenarios. The authors evaluated Meta-Prism 2's performance, both in terms of separating different biomes' microbial profiles and time/memory usage, on a variety of real-world studies. I find the application target of Meta-Prism appealing: achieving efficient dissimilarity profiling is increasingly relevant for modern microbiome applications. However, I'm afraid the manuscript appears to be in poor state, with insufficient details for crucial methods and results components. Some display items are either missing or mis-referenced. As such, I cannot recommend for its acceptance, unless major improvements are made. My comments are detailed below.

Major

The authors claim that from its previous iteration, the biggest improvements are: (1) removal of redundant nodes in 1-against-N sample comparisons. (2) functionality for similarity matrix calculation (3) exhaustive search among all available samples.

a. (1) seems the most crucial for the method's improved efficiency. However, the details on why these nodes can be eliminated, and how dissimilarity calculation is achieved post-elimination are not sufficient. The caption for Figure 1C, and relevant Methods texts (lines 173-188) should be expanded, to at least explain i) why it is valid to calculate (dis)similarity postelimination based on aggregation, ii) how aggregation is achieved for the target samples. b. I may not have understood the authors on (2), but this improvement seems trivial? Is it simply that Meta-Prism 2 has a new function to calculate all pair-wise dissimilarities on a collection of microbial profiles? c. For (3), it should be made clearer that Meta-Prism 1 does not do this. I needed to read the authors' previous paper to understand the comment about better flexibility in customized datasets. I assume that this improvement is enabled because Meta-Prism 2 is vastly faster compared to 1? If so, it might be helpful to point this out explicitly.

I am lost on the accuracy evaluation results for predicting different biomes (Figure 2). a. How are biomes predicted for each microbial sample? b. What is the varying classification threshold that generates different sensitivities and specificities? c. Does "cross-validation" refer to e.g. selection of tuning parameters during model training, or for evaluation model performances? d. What are the "Fecal", "Human", and "Combined" biomes for the Feast cohort? Such details were not provided in Shenhav et al.

Moderate

I understand that this was previously published, but could the authors comment on the intuitions behind their dissimilarity measure, and how it compares to similar measures such as the weighted UniFrac? a. Does Meta-Storm and Meta-Prism share the same similarity definition? If so, why would they differ in terms of prediction accuracies?
There seems to be some mis-referencing on the panels of Figure 1. a. Panel B was not explained at all in the figure caption. b. Line 185 references Figure 1E, which does not exist.

Minor

The Meta-Prism 1 publication was referenced with duplicates (#16 and 24)
There are minor language issues throughout the manuscript, but for they do not affect understanding of the materials. Examples: a. Line 94: analysis -> analyze b. Line 193: We also obtained a dataset that consists of ...

Re-review:

I find most of my questions addressed. My only remaining issue is still that the three biomes from FEAST (Fecal, Human, and Mixed) are still not clearly defined. The only definition I could find is line 206-208 "We also obtained a dataset that consists of 10,270 samples belonging to three biomes: Fecal, Human, and Mixed, which have been used in the FEAST study, defined as the FEAST dataset". Are "Fecal" simply stool samples, and "Human" samples biopsies from the human gut? What is "Mixed"? As a main utility of Meta-Prism is source tracking, it is important for the reader to understand what these biomes are, to understand the resolution of the source tracking results. If this can be resolved, I'll be happy to recommend the manuscript's acceptance.

**Reviewer 2. Yoann Dufresne **

In this article the authors present Meta-Prism 2, a software to compute distances between metagenomic samples and also query a specific sample against a pool of samples. They call "sample" a precomputed file with abundance of multiple taxa. In the article they first succinctly present multiple aspects on the underlying algorithms. Then they provide an extensive analysis on the precision, ram and time consumption of the software. Finally, they show 3 applications of Meta-Prism 2.

I will start to say that the execution time of the tool looks very good compared to all other tools. But I have multiple concerns about these numbers.

First, I like to reproduce the results of a paper before approving it. But I had a few problems doing so.

The tool do not compile as it is on git. I had to modify a line of code to compile it. This is nothing very bad but authors of tools should be sure that their main code branch is always compiling. See the end of the review for bug and fix.
The analysis are done using samples from MGnify. I found related OTU tsv files linked in the supplementary but no explanation on how to transform such files in pdata files that the software is processing.
The only way to directly reproduce the results is to trust the pdata files present on the github of the authors. I would like to make my own experiments and compare the time to transform OTU files into pdata with the actual run time of MP2.

The authors evaluated the accuracy of their method (which is nice) but did not gave access on the scripts that were used for that. I would like to see the code and try to reproduce the figure by myself on my own data.
The 2nd and 3rd applications are explained in plain text but there is no script related neither any table of graphics to reproduce or explain the results. The only way for me to evaluate this part is to trust the word of the authors. I would like the authors to show me clear and indisputable evidences.

For the methods part it is similar. We have hints on what the authors did, but not a full explanation:

For the similarity function, I would like to know where it comes from. The cited papers [14] and [24] do not help on the comprehension of the formula. If the function is from another paper, I ask the authors to add a clear reference (paper + section in the paper) ; if not, I would like the authors to explain in details why this particular function, how they constructed it and how it behaves.
The authors refer multiple times to "sparse format" applied to disk & cache but never defined what they mean by that. I would like to see in this section which exact datastructure is used.
In the Fast 1-N sample comparison, the authors write about "current methods" but without citing them. I would like the authors to refer to precise methods/software, succinctly describe them and then compare their methods on top of that. Also in this part, the authors point at figure 1E that is not present in the manuscript.
The figure 1 is not fully understandable without further details in the text. For example, what is Figure 1C4 ?

I want to point that the paper is not correctly balanced in term of content. 1.5 page for time execution analysis is too much compared to the 2 pages of methods and less than 1 page of real data applications.

Finally, the authors are presenting a software but are not following the development standards. They should provide unit and functional tests of their software. I also strongly recommend them to create a continuous integration page with the git. With such a tool the compilation problem would not exist.

To conclude, I think that the authors very well engineered the software but did not present it the right way. I suggest the authors to rewrite the paper with strong improvements of the "methods" and "Real data application" sections. Also, to provide a long term useful software, they have to add guaranties to the code as tests and CI.

For all these reasons, I recommend to reject this paper.

--- Bug & Fix ---

make mkdir -p build g++ -std=c++14 -O3 -m64 -march=native -pthread -c -o build/loader.o src/loader.cpp g++ -std=c++14 -O3 -m64 -march=native -pthread -c -o build/newickParser.o src/newickParser.cpp g++ -std=c++14 -O3 -m64 -march=native -pthread -c -o build/simCalc.o src/simCalc.cpp g++ -std=c++14 -O3 -m64 -march=native -pthread -c -o build/structure.o src/structure.cpp g++ -std=c++14 -O3 -m64 -march=native -pthread -c -o build/main.o src/main.cpp src/main.cpp: In function 'int main(int, const char**)': src/main.cpp:128:31: error: 'class std::ios_base' has no member named 'clear' 128 | buf.ios_base::clear(); | ^~~~~ make: *** [makefile:7: build/main.o] Error 1

To fix the bug: src/main.cpp:128 => buf.ios.clear();

Read the original source

Meta-Prism 2.0: Enabling algorithm for ultra-fast, accurate and memory-efficient search among millions of microbial community samples

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Motivation

Results

Conclusion

Article activity feed

BandHiC: a memory-efficient and user-friendly Python package for organizing and analyzing Hi-C matrices down to sub-kilobase resolution

2Pipe: It Starts with a Question. Matching You with the Correct Pipeline for MAG Reconstruction

Optimizing Genomic Data Compression with Genetic Algorithms

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Motivation

Results

Conclusion

Article activity feed

Related articles

BandHiC: a memory-efficient and user-friendly Python package for organizing and analyzing Hi-C matrices down to sub-kilobase resolution

2Pipe: It Starts with a Question. Matching You with the Correct Pipeline for MAG Reconstruction

Optimizing Genomic Data Compression with Genetic Algorithms