CompactTree: a lightweight header-only C++ library and Python wrapper for ultra-large phylogenetics
Curation statements for this article:-
Curated by GigaByte
Editors Assessment:
As volumes of viral and bacterial sequence data grow exponentially, the field of computational phylogenetics now demands resources to manage the burgeoning scale of this input data. This study introduces CompactTree, a C++ library designed for ultra-large phylogenetic trees with millions of tips. To address these scalability issues while maintaining ease of incorporation into external code bases, CompactTree is a header-only library with enhanced performance utilizing minimal dependencies, optimized node representation, and memory-efficient tree structure schemes. Resulting in significantly reduced memory footprints and improved processing times. Peer review requested some more detail on the functionality and some real-world examples, demonstrating the current utility of the tool. Although primarily supporting the (text-based) Newick format, the increased and extensibility scalability holds promise for multiple biological and epidemiological applications supporting more complex formats such as Nexus and NeXML. The tool is open source (GPLv3 licensed) and available in GitHub: https://niema.net/CompactTree
This evaluation refers to version 1 of the preprint
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (GigaByte)
- Endorsed by GigaByte (scotted400)
Abstract
The study of viral and bacterial species requires the ability to load and traverse ultra-large phylogenies with tens of millions of tips, but existing tree libraries struggle to scale to these sizes. We introduce CompactTree, a lightweight header-only C++ library with a user-friendly Python wrapper for traversing ultra-large trees that can be easily incorporated into other tools. We show that CompactTree is orders of magnitude faster and requires orders of magnitude less memory than existing tree packages. CompactTree is freely accessible as an open source project: https://github.com/niemasd/CompactTree
Article activity feed
-
Editors Assessment:
As volumes of viral and bacterial sequence data grow exponentially, the field of computational phylogenetics now demands resources to manage the burgeoning scale of this input data. This study introduces CompactTree, a C++ library designed for ultra-large phylogenetic trees with millions of tips. To address these scalability issues while maintaining ease of incorporation into external code bases, CompactTree is a header-only library with enhanced performance utilizing minimal dependencies, optimized node representation, and memory-efficient tree structure schemes. Resulting in significantly reduced memory footprints and improved processing times. Peer review requested some more detail on the functionality and some real-world examples, demonstrating the current utility of the tool. Although primarily supporting the …
Editors Assessment:
As volumes of viral and bacterial sequence data grow exponentially, the field of computational phylogenetics now demands resources to manage the burgeoning scale of this input data. This study introduces CompactTree, a C++ library designed for ultra-large phylogenetic trees with millions of tips. To address these scalability issues while maintaining ease of incorporation into external code bases, CompactTree is a header-only library with enhanced performance utilizing minimal dependencies, optimized node representation, and memory-efficient tree structure schemes. Resulting in significantly reduced memory footprints and improved processing times. Peer review requested some more detail on the functionality and some real-world examples, demonstrating the current utility of the tool. Although primarily supporting the (text-based) Newick format, the increased and extensibility scalability holds promise for multiple biological and epidemiological applications supporting more complex formats such as Nexus and NeXML. The tool is open source (GPLv3 licensed) and available in GitHub: https://niema.net/CompactTree
This evaluation refers to version 1 of the preprint
-
AbstractMotivation The study of viral and bacterial species requires the ability to load and traverse ultra-large phylogenies with tens of millions of tips, but existing tree libraries struggle to scale to these sizes.Results We introduce CompactTree, a lightweight header-only C++ library for traversing ultra-large trees that can be easily incorporated into other tools, and we show that it is orders of magnitude faster and requires orders of magnitude less memory than existing tree packages.Availability CompactTree can be accessed at: https://github.com/niemasd/CompactTreeContact niema{at}ucsd.eduSupplementary information Supplementary data are available at Bioinformatics online.
This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.152). These reviews (including a protocol …
AbstractMotivation The study of viral and bacterial species requires the ability to load and traverse ultra-large phylogenies with tens of millions of tips, but existing tree libraries struggle to scale to these sizes.Results We introduce CompactTree, a lightweight header-only C++ library for traversing ultra-large trees that can be easily incorporated into other tools, and we show that it is orders of magnitude faster and requires orders of magnitude less memory than existing tree packages.Availability CompactTree can be accessed at: https://github.com/niemasd/CompactTreeContact niema{at}ucsd.eduSupplementary information Supplementary data are available at Bioinformatics online.
This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.46471/gigabyte.152). These reviews (including a protocol review) are as follows.
Reviewer 1. Jeet Sukumaran
Is the documentation provided clear and user friendly? Yes. Excellent documentation. A pleasure to read. Are there (ideally real world) examples demonstrating use of the software? No.
Reviewer 2. Ziqi Deng
Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Yes. I'm able to run all the tests and used CompactTree c++ correctly except for encounter issue installation installing Python Wrapper via pip install CompactTree.
Are there (ideally real world) examples demonstrating use of the software? Yes. CompactTree has provided examples of simulated trees for testing comparing to other peer packages. In the meanwhile it mentioned its ability to load the ~22M nodes greengenes2 tree. It would be great to see the test workflow so users can verify.
Additional Comments: CompactTree is aimed at a very specific task, that of loading large phylogenetic trees with millions of nodes. The result shows that it is significantly faster than the other peer tools not only in loading but also in traversing trees, with less peak memory usage. It also includes the test workflow for users to repeat the test in comparison with other peer tools.
Reviewer 3. Giorgio Bianchini
Is the language of sufficient quality? Yes. It is slightly confusing that the paper is written using plural pronouns ("We"), when there is a single author.
Is there a clear statement of need explaining what problems the software is designed to solve and who the target audience is? No. The statement of need is present; however, it does not clearly explain what kinds of problems the software will be able to solve, beyond generic statements about addressing scalability issues. The aims of the library should be explored in more detail: as noted by the author, this library offers great speed and efficiency, but at the cost of reduced flexibility and functionality compared to other tools. Speed and efficiency are always good things, but what does the library actually do? A very fast library that does nothing is not particularly useful. So, what specific analyses does CompactTree allow, that would be impractical using other tools? For example, they could select a case study from the literature, where the analyses were limited by the algorithm, and use their library to extend the analysis to a larger dataset. The author mentions clustering, ancestral state reconstruction, and transmission risk prediction as examples of analyses that involve tree traversals, so they could start here (although I am not convinced that the efficiency of the tree representation is the computational bottleneck in these cases). The results should also be briefly mentioned in the abstract. Furthermore, the author mentions a number of packages used to analyse trees, but these are all Python packages. Since CompactTree is presented as a C++ library, this seems odd; other tools and programming languages should be mentioned/compared. For example, “ape” and “phytools” are very popular R packages, while “Bio++” is another C++ library; a literature review (or a simple web search) may reveal other such libraries. Also, the reference given for bp (“[4]”) is incorrect.
Is installation/deployment sufficiently outlined in the paper and documentation, and does it proceed as outlined? Yes. Everything works fine if the header is included in a single source file, but if multiple distinct files contain the #include statement, a compilation error will occur due to the multiple definitions. In a real-world application, the library would reasonably need to be included in multiple source files, so this should be fixed.
Is the documentation provided clear and user friendly? Yes. The documentation "Cookbook" is very nicely organised.
Have any claims of performance been sufficiently tested and compared to other commonly-used packages? No. While the author compares CompactTree to a number of Python packages, no comparison is made against tools that use other programming languages. In particular, the author states that there is no C++ library for loading and traversing phylogenetic trees; however, as I mentioned, at least Bio++ exists and appears to be reasonably well cited. Furthermore, the memory plot does not consider the baseline memory usage. This is evident in the first two datapoints (n=100 and n=1000) for each tool, which show a very small difference, despite the leaf count increasing by an order of magnitude. If the first datapoint is subtracted from all subsequent datapoints, the memory plot looks quite similar to the other plots. If you re-run the benchmarks to include other tools, I would suggest including a “control” datapoint with a very small n (or even, loading the library without opening a tree), and subtracting this from all other datapoints; this will provide an estimate of the memory actually used to load the trees.
Are there (ideally real world) examples demonstrating use of the software? No. As I mentioned above, having at least one example demonstrating an analysis that is significantly improved by the use of this library would be beneficial. Discussion of the improvements should also consider usability trade-offs in a real-world scenario.
Is automated testing used or are there manual steps described so that the functionality of the software can be verified? No.
Additional Comments: The library looks promising and is reasonably well documented, the only two things that are really missing are a real-world practical application and a comparison with other relevant alternatives (especially Bio++). A large portion of the manuscript is spent describing how the library could be improved, rather than what it can currently do. This could be summarised in just one or two sentences, thus leaving more space for describing the real-world example.
-
-