MBGC: Multiple Bacteria Genome Compressor

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Genomes within the same species reveal large similarity, exploited by specialized multiple genome compressors. The existing algorithms and tools are however targeted at large, e.g., mammalian, genomes, and their performance on bacteria strains is rather moderate.

Results

In this work, we propose MBGC, a specialized genome compressor making use of specific redundancy of bacterial genomes. Its characteristic features are finding both direct and reverse-complemented LZ-matches, as well as a careful management of a reference buffer in a multi-threaded implementation. Our tool is not only compression efficient but also fast. On a collection of 168,311 bacterial genomes, totalling 587 GB, we achieve a compression ratio of approximately a factor of 1,265 and compression (respectively decompression) speed of ∼1,580 MB/s (respectively 780 MB/s) using 8 hardware threads, on a computer with a 14-core/28-thread CPU and a fast SSD, being almost 3 times more succinct and >6 times faster in the compression than the next best competitor.

Article activity feed

  1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab099), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 3: idoia ochoa

    The authors present a novel tool for the compression of collections of bacterial genomes. The authors present sound results that demonstrate the performance gain of their tool, MBGC, with respect to the state-of-the-art. As such, I do not have concerns about the method itself. My main concerns are with respect the description of the tool, and how the results are presented. Next I list some of my suggestions (in no particular order):

    Main Paper: - Analysis section: Before naming MBGC specify that it is the proposed tool. - Analysis section: Reference for HRCM. Mention here also that other tools such as iDoComp, GDC2, etc. are discussed in the Supplementary (this way the reader knows more tools were analyzed or at least tried on the data).

    • Analysis section: The paragraph "Our experiments with MBGC show that... " is a little misleading, since it seems that the tool has the capacity to compress a collection and just extract a single genome from it. This becomes clear later in the text when it is discussed how the tool could be used to speed up the download of a collection of genomes from a repository. So maybe explain that in more detail here, or mention that it could be used to compress a bunch of genomes prior to download. And then point to the part of the text where this is discussed in more detail.

    • Analysis section: The results talk about the "stronger MGBC mode", the "MGBC max", but in the tables it reads "MBGC default" or "MBGC -c 3". I assume "MBGC -c 3" refers to "MBGC max", but it is not stated anywhere. maybe better to call it "MBGC default" and "MBGC max".

    • Analysis section: Although the method is explained later in the text, it would be a good idea to give a sense of the difference between the default and max modes of the tool. Or some hints on the trade-off between the two. Also, the parameter "-c 3" is never explained.

    • Analysis section: Figures, it is difficult to see the trade-off between relative size and relative time, can you use colored lines? such that the same color refers to the same set of genomes. Also, in the caption, explain if we want small or high relative size and time. it may be clear, but better to clearly state it.

    • Analysis section: there is a sentence that says "all figures w.r.t. the default mode of MBCG". It would be good also to state that in the caption, so that the reader knows which mode of the tool is being used to generate the presented results. and if the input files are gzipped or not. For example, for the following paragraph that starts with Fig. 1, it is not clear if the files are gzipped or not.

    • Analysis section: First time GDC2 is mentioned, the first thing that comes to mind is why it was not used for the bacterial experiments. See my previous point on having a couple of sentences about the other tools that were considered, and why they are not included in the main tables/figures.

    • Methods:

    -- Here I am really missing a diagram explaining the main steps of the tool. It seems the paper has been rewritten slightly to fit the format of the journal and some things are not in the correct order. For example, it says the key ideas are already sketched, but i do not think that is true.

    -- (offset, length) i assume refers to the position of the REF where the match begins, and the length of the match, but again, not really explained. A diagram would help. Also, when it is time to compress the pairs, are the offset delta encoded? or encoded as they are with a general compressor?

    -- How are the produced tokens (offset, length, literals, etc.) finally encoded?

    -- First time parameter "k" is mention, default value? Also, how can you do a left extension and "swallow" the previous match? is it because the previous match could have been at another position? otherwise if it was in that position it would have been already extended to the right, correct? i mean, it would have generated a longer match.

    -- The "skip margin" idea is not well explained. not sure why the next position after a match is decreased by m. please explain better or use a diagram with an example.

    -- when you mention 1/192, maybe already state that this is controlled by the parameter u. otherwise when you mention the different parameters is difficult to relate them to the explanation of the algorithm.

    Availability of supp...

    -- from from (typo) Tables

    -- Specify the number of genomes in each collection.

    -- change MBGC -c 3 to MBGC max or something similar. (see my previous comment -c flag is not explained!)

    Supplementary Material

    -- move table 1 after the text for ease of reading

    -- not clcear if the tool has random access or not. it is discussed the percentage of time (w.r.t. decompreessing the whole collection i believe) that it would take to decompress one of the first gneomes vs one of the last ones. this should be better explained. for example, if we decompress the last genome of the collection we will employ 100% of the time, right? given that previous genomes are part of REF (potentially). please explain better and discuss this point in the analysis part, not only in the supplementary. seems like an important aspect of the algorithm.

    -- I assume this is not possible, but should be discussed as well. can you add a genome to an already compressed collection? this together with the random access capabilities will highlight better the main possible uses of the tool.

    -- section 4.3: here HT is used, and then HT is introduced in the next paragraph. please revise the whole text and make sure everything is in the right order.

    -- parameter m, please explain better.

    -- add colors to figures, it will be easier to read them. Overall, as I mentioned before, I believe the tool offers significant improvements with respect to the competitors for bacterial genomes, and performs well on non bacterial genomes as well. What should be improved for publication is the description of the method, since at the end of the day is the main contribution, and how the text is presented.

  2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab099), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2: Jinyan Li

    This paper proposed a compression algorithm to compress sets of bacterial genome sequences. The motivation is based on the reason that the existing algorithms and tools are targeted at large, e.g., mammalian, genomes, and their performance on bacteria strains is unknown. The key idea of the proposed method is to detect characteristic features from both the direct and reverse-complemented copies of the reference genome via LZ-matching. The compression ratio is high and the compression speed is fast. Specifically, on a collection of 168,311 bacterial genomes (587 GB in file size), the algorithm achieved a compression ratio around the factor of 1260. The author claimed that the performance is much better than the existing algorithms. Overall, the quality of the paper is quite good.

    I have two suggestions for the author to improve the manuscript:

    1/ it's not clear to me about this sentence that "we focus on the compression of bacterial genomes, for which existing genome collection compressors are not appropriate from algorithmic or technical reasons." More clarifications are needed.

    2/ With my own experience, GDC2 has a better performance on virus genome collections than HRCM. It's strongly suggested for the author to add the performance of GDC2 on the bacterial genome collections.

  3. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab099), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Diogo Pratas

    This article presents a new compressor that uses both direct and reverse-complemented LZ-matches with multi-threaded and cache optimizations.

    Generally, the reported results of this tool are exciting, and once confirmed, they have good applicability in the bioinformatics community.

    However, I could not reproduce the results by lack of instructions, the benchmark is not representative of the state-of-the-art, and there are also several associated questions. Below the comments are specified.

    Regarding the experiments:

    1. The experiments could not be reproduced. Unfortunately, the instructions and documentation are not clear (See below my tentatives).

    2. The benchmarking is missing several well-known tools (for example, naf, geco3, Deliminate, MFCompress, Leon, ...). See, for example:

    Kryukov, Kirill, et al. "Nucleotide Archival Format (NAF) enables efficient lossless reference-free compression of DNA sequences." Bioinformatics 35.19 (2019): 3826-3828.

    Silva, Milton, et al. "Efficient DNA sequence compression with neural networks." GigaScience 9.11 (2020): giaa119.

    Yao, Haichang, et al. "Parallel compression for large collections of genomes." Concurrency and Computation: Practice and Experience (2021): e6339.

    To access more compressors, please see the following benchmark (that is already cited in the article): https://academic.oup.com/gigascience/article/9/7/giaa072/5867695

    Regarding the manuscript:

    1. The State-of-the-art in genomic data compression (or at least in collections of genomes) is brief and does not offer a consistent and diverse description of the already developed tools.

    2. "By the compression ratio we mean the ratio between the original input size and the compressed size. If, for example, the ratio improves from 1000 to 1500, e.g., due to changing some parameters of the compressor, we can say that the compression ratio improves 1.5 times (or by 50%)." This sentence seems a little confusing (at least for me). Please, rephrase.

    3. "The performance of the specialized genome compressor, HRCM [7], is only mediocre, and we refrained from running it on the whole collection, as the compression would take about a week.". The purpose of a data compressor can be very different: to use in machines with lower RAM, for compression-based analysis, for long-term storage, research purposes, among others. The qualification of HRCM without putting it into context seems to be depreciative.

    Regarding the tool and documentation:

    1. Although I have downloaded and compiled the tool, I had to dedicate some minutes to a "libdeflate" default version issue. The majority of the bioinformatics community uses conda. In order to minimize installation issues for the users, please, provide a conda installation for the proposed tool. Also, the libdeflate can already be retrieved with conda. Then, with the instructions for the installation of mbgc, please, add this line to mbgc repository:

    conda install -c bioconda libdeflate

    Notice that this "conda" part is a suggestion that will facilitate the usage of mbgc by the bioinformatics community.

    1. Running ./mbgc gives the output:

    ./mbgc: For compression expected 2 arguments after options (found 0)

    try './mbgc -?' for more information

    If the menu appears as default (no arguments besides the program's name), it will be much more helpful.

    1. The program should have a version flag to depict the version of the program (besides the version at the menu). This feature is essential for integration/implementations (e.g., conda) and to differentiate from eventual new versions to the mbgc software.

    2. Please, provide a running example at the help menu (with tiny existing sequences at the repository).

    3. Is this characteristic of mbgc a strict property: "decompresses DNA streams 80 bases per line"? This characteristic may create differences between original files and uncompressed files. Perhaps, having the possibility to have a custom line size would be a valuable feature, at least for data compression scientists to access and compare with other compressors, mainly because it makes the decompressor not completely lossless (although in practice, there is minimal information required to maintain the whole lossless property). Nevertheless, if the program decompresses FASTA data with a unique line size (for DNA bases) of 80 bases, this should also be mentioned in the article (besides what already exists in the repository).

    4. The first impression was that "sequencesListFile" are the IDs of the bacterial genomes, then I found out that they are the URL-suffixes for the FASTA repository. Then, I start to wonder if mbgc could accept directly the FASTA containing the collection of genomes. How can the user provide the FASTA file directly? This feature would simplify a lot the usage of mbgc. Rationale: most of the reconstruction pipelines output multi-FASTA sequences in a single file. Therefore, this feature has direct applicability. Please, add more information about this in the help print and at the README. A higher goal would be to have stdin and stdout in compression/decompression as an option and the style of the argument as POSIX (Program Argument Syntax Conventions). This features are important for building bioinformatics pipelines and perform analysis (especially since the tools seems to be ultra-fast).

    5. Table 1,2,3,4 (and the additional table at supplementary material) have "Compress / decompress times (as "ctime" / "dtime") are given in seconds," but no unity reference is provided in the cap to the cmemory and dmemory. Is this value on a GigaByte unity?

    6. The README should provide a small example for testing purposes with the files already available at the repository or by efetch download (see below).

    7. The reproducibility is hard to follow:

    I had to search for the following procedure to test the software:

    wget https://github.com/kowallus/mbgc/releases/download/v1.1/tested_samples_lists.7z

    7z e tested_samples_lists.7z

    After the cere download, also tar -vzxf cere_assemblies.tgz

    Then, I realized that it was missing the sequences, and by the NCBI interface, I lost track. I gave up after a few segmentation faults/combinations without understanding if the program or the settings generated the issue.

    Please provide supplementary material and README with the complete instructions to reproduce the experiments (the exact commands).

    Also, this simple way to download a multi-FASTA file with Escherichia Coli sequences may be helpful: conda install -y -c conda-forge -c bioconda -c defaults entrez-direct

    esearch -db nucleotide -query "Escherichia coli" | efetch -format fasta > Escherichia.mfa