Taxonium, a web-based tool for exploring large phylogenetic trees

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    Sanderson developed novel interactive software for visualizing phylogenetic trees representing millions of sequences. This is a fundamental advance over previous software that is typically limited to trees with a few thousand tips. Taxonium has been used intensively by the virus evolution community over the past months and has thus already proven its utility and performance.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

The COVID-19 pandemic has resulted in a step change in the scale of sequencing data, with more genomes of SARS-CoV-2 having been sequenced than any other organism on earth. These sequences reveal key insights when represented as a phylogenetic tree, which captures the evolutionary history of the virus, and allows the identification of transmission events and the emergence of new variants. However, existing web-based tools for exploring phylogenies do not scale to the size of datasets now available for SARS-CoV-2. We have developed Taxonium, a new tool that uses WebGL to allow the exploration of trees with tens of millions of nodes in the browser for the first time. Taxonium links each node to associated metadata and supports mutation-annotated trees, which are able to capture all known genetic variation in a dataset. It can either be run entirely locally in the browser, from a server-based backend, or as a desktop application. We describe insights that analysing a tree of five million sequences can provide into SARS-CoV-2 evolution, and provide a tool at cov2tree.org for exploring a public tree of more than five million SARS-CoV-2 sequences. Taxonium can be applied to any tree, and is available at taxonium.org , with source code at github.com/theosanderson/taxonium .

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    The software presented in this paper is well documented and represents a significant achievement that breaks new ground in terms of what is possible to render and explore in the web browser. This tool is essential for the exploration of SC2 data, but equally useful for the tree of life and other tree-like data sets.

    Thank you for reviewing my work and for this generous assessment.

    Reviewer #2 (Public Review):

    This manuscript describes a web-based tool (Taxonium) for interactively visualizing large trees that can be annotated with metadata. Having worked on similar problems in the analysis and visualization of enormous SARS-CoV-2 data sets, I am quite impressed with the performance and "look and feel" of the Taxonium-powered cov2tree web interface, particularly its speed at rendering trees (or at least a subgraph of the tree).

    Thank you for the kind words.

    The manuscript is written well, although it uses some technical "web 2.0" terminology that may not be accessible to a general scientific readership, e.g., "protobuf" (presumably protocol buffer) and "autoscaling Kubernetes cluster". The latter is like referring to a piece of lab equipment, so the author should provide some sort of reference to the manufacturer, i.e., https://kubernetes.io/.

    Thank you for flagging this. I have now replaced the colloquial "protobuf" with "protocol buffer". I have now provided a URL for Kubernetes. It is always difficult to judge how much to explain technical terms. I certainly agree that many people will be unfamiliar with, for instance, protocol buffers, but an explanation of what they are (which may not be particularly important for understanding Taxonium) can sometimes overshadow more important details. So my preference in that particular case is for an interested reader to research the unfamiliar term.

    In other respects, the manuscript lacks some methodological details, such as exactly how the tree is "sparsified" to reduce the number of branches being displayed for a given range of coordinates.

    This is an important point also raised by Reviewer 3. I have added a new section in the Materials and Methods which discusses this in some detail.

    Some statements are inaccurate or not supported by current knowledge in the field. For instance, it is not true that the phylogeny "closely approximates" the transmission tree for RNA viruses.

    I agree that this was an overly broad claim, and have softened it, now saying:

    "The fundamental representation of a viral epidemic for genomic epidemiology is a phylogenetic tree, which approximates the transmission tree and can allow insights into the direction of migration of viral lineages."

    Mutations are not associated with a "point in the phylogeny", but rather the branch that is associated with that internal node.

    I have changed this as suggested.

    A major limitation of displaying a single phylogenetic tree (albeit an enormous one) is that the uncertainty in reconstructing specific branches is not readily communicated to the user. This problem is exacerbated for large trees where the number of observations far exceeds the amount of data (alignment length). Hence, it would be very helpful to have some means of annotating the tree display with levels of uncertainty, e.g., "we actually have no idea if this is the correct subtree". DensiTree endeavours to do this by drawing a joint representation of a posterior sample of trees, but it would be onerous to map a user interface to this display. I'm raising this point as something for the developers to consider as a feature addition, and not a required revision for this manuscript.

    I entirely agree with this point. I have added a sentence in the discussion:

    "Even where sequences are accurate, phylogenetic topology is often uncertain, and finding ways to communicate this at scale, building on prior work [Densitree citation] would be valuable."

    The authors make multiple claims of novelty - e.g., "[...] existing web-based tools [...] do not scale to the size of data sets now available for SARS-CoV-2" and "Taxonium is the only tool that readily displays the number of independent times a given mutation has occurred [...]" - that are not entirely accurate. For example, RASCL (https://observablehq.com/@aglucaci/rascl) allows users to annotate phylogenies to examine the repeated occurrence of specific mutations. Our own system, CoVizu, also enables users to visualize and explore the evolutionary relationships among millions of SARS-CoV-2 genomes, although it takes a very different approach from Taxonium. Taxonium is an excellent and innovative tool, and it should not be necessary to claim priority.

    I agree that comparisons with existing tools are difficult and often provide a sense of unnecessary competition. I attempted to be quite careful in the specific section focused on comparison, but may have been less careful earlier on. The intent with this first sentence in the abstract was to provide a succinct description of the gap that Taxonium was developed to fill with "however, existing web-based tools for analysing and exploring phylogenies do not scale to the size of datasets now available for SARS-CoV-2". I have now removed the words "analysing and", focusing on the exploration of phylogenies. I think this new sentence is defensible in that valuable tools such as CoVizu intentionally do not explore a phylogeny directly but instead take a multi-level approach, and this new sentence better matches the comparisons in the paper. In the second sentence, I have removed the phrase "is the only tool that", which I agree adds little and may not be accurate, depending on one's interpretation of "readily". Thank you for these points.

    Although the source code (largely JavaScript with some Python) is quite clean and has a consistent style, there is a surprising lack of documentation in the code. This makes me concerned about whether Taxonium can be a maintainable and extensible open-source project since this complex system has been almost entirely written by a single developer. For example, usher_to_taxonium.py has a single inline comment (a command-line example) and no docstring for the main function. JBrowsePanel.jsx has a single inline comment for 293 lines of code. There is some external documentation (e.g., DEVELOPMENT.md) that provides instructions for installing a development build, but it would be very helpful to extend this documentation to describe the relationships among the different files and their specific roles. Again, this is something for the developers to consider for future work and not the current manuscript.

    This is an entirely fair comment. The version of Taxonium presented in the manuscript is "2.0", which is a new version built from scratch with considerably less technical debt than the version that preceded it. Its technical strengths are that (with the exception of the backend) it is relatively well-modularised into functional components. But the limitations that the reviewer notes with respect to commenting are entirely fair. What I would say is that in the time since this manuscript was submitted, several important features have been added by an external collaborator, Alex Kramer, most notably the Treenome Browser (https://www.biorxiv.org/content/10.1101/2022.09.28.509985v1). I hope that the ability of Alex to add these features with little need for support provides some evidence of Taxonium's extensibility. But I acknowledge there is room for improvement.

    Reviewer #3 (Public Review):

    The paper succinctly provides an overview of the current approaches to generating and displaying super-large phylogenies (>10,000 tips). The results presented here provide a comprehensive set of tools to address the display and exploration of such phylogenies. The tools are well-described and comprehensive, and additional online documentation is welcome.

    The technical work to display such large datasets in a responsive fashion is impressive and this is aptly described in the paper. The author rightly decides that displaying large phylogenies is not simply a matter of rendering "more nodes", and so in my eyes, the major advancement is the approach used to downsample trees on-the-fly so that the number of nodes displayed at one time is manageable. This is detailed only briefly (Results section, 1st paragraph, 2 sentences). I would like to see more discussion about the details of this approach.

    Thank you for this point, also raised by Reviewer 2. I have now added a lengthy section on this in the Materials and Methods, which I hope is helpful. The approach is not especially sophisticated, but it does the job and runs quickly.

    Examples that came up while exploring the tool: the (well implemented) search functionality reports results from the entire tree (e.g. in Figure 4, the number of red circles is not a function of zoom level), how does this interact with a tree showing only a subset of nodes?

    Yes, this is an important feature which I perhaps did not do justice to in the write-up. I have included in the new section in the Materials and Methods a paragraph discussing search results:

    "In order to ensure that search results are always comprehensive, but at the same time to avoid overplotting, we take the following approach::

    ● Searches are performed across every single node on the tree to select a set of nodes that match the search. The total number of matches is displayed in the client.

    ● If fewer than 10,000 matches are detected, these are simply displayed in the client as circles

    ● If more than 10,000 matches are detected, the results are sparsified using the method above, and then displayed.

    ● Upon zooming or panning, the sparsification is repeated for the new bounding box."

    How is the node order chosen with regards to "nodes that would be hidden by other nodes are excluded" and could this affect interpretations depending on the colouring used?

    This perhaps was slightly sloppy language which did not directly describe the implementation. I have now rephrased this to "only nodes that overlap other nodes are excluded", as we don't in fact consider a notion of z-index when doing this. The way the sparsification works (now better described) means that the nodes excluded are determined essentially by position and I don't foresee this introducing particular biases, but this was an insightful point to raise.

    Taxonium takes the approach of displaying all available data (sparsification of nodes notwithstanding). Biases in the generation of sequences, especially geographical, will therefore be present (especially so in the two main datasets discussed here - SARS-CoV-2 and monkeypox). This caveat should be made explicit.

    This is certainly true. I have added this new paragraph in the Discussion:

    "A further challenge is the vastly different densities of sampling in different geographic regions. Because Cov2Tree does not downsample sequences from countries which are able to sequence a greater proportion of their cases, the number of tips on a tree is not indicative of the size of an outbreak and in some cases even inferences of the directionality of migration may be confounded. There would be value in the development of techniques that allow visual normalisation of trees for sampling biases, which might allow for less biased phylogenetic representations without downsampling."

    Has the author considered choosing which nodes to exclude for sparsified trees in such a way as to minimise known sampling biases?

    The last sentence of the new paragraph above alludes to a sort-of-similar approach. I hadn't directly considered the approach the reviewer suggests. It is an interesting idea. The downsampling algorithm has to be very computationally inexpensive but it would be interesting to explore ways to do this. I am tracking this in https://github.com/theosanderson/taxonium/issues/437.

    Interoperability between different software tools is discussed in a technical sense but not as it pertains to discovering the questions to ask of the data. As an example, spotting the specific mutations shown in figure 3 + 4 is not feasible by checking every position iteratively; instead, the ability to have mutations flagged elsewhere and then seamlessly explore them in Taxonium is a much more powerful workflow. This kind of interoperability (which Taxonium supports) enhances the claim of "providing insights into the evolution of the virus".

    Thank you for flagging this point -- I am very excited by the growing ecosystem of interoperable tools. You are absolutely right that most of the insights Taxonium can bring into evolution rely also on this broader ecosystem. I have added a florid sentence in the concluding paragraph: "It forms part of an ecosystem of open-source tools that together turn an avalanche of sequencing data into actionable insights into ongoing evolution."

    The prosaic reason I don't discuss Taxonium's interoperability features in more detail in this manuscript is that it aims to describe the version of Taxonium I initially developed, and these features were developed collaboratively by a broader group later on (and after deposition of this preprint).

    Taxonium has been a fantastic resource for the analysis of SARS-CoV-2 and this paper fluently presents the tool in the context of the wider ecosystem of bioinformatic tools in use today, with the interoperability of the different pieces being a welcome direction.

  2. eLife assessment

    Sanderson developed novel interactive software for visualizing phylogenetic trees representing millions of sequences. This is a fundamental advance over previous software that is typically limited to trees with a few thousand tips. Taxonium has been used intensively by the virus evolution community over the past months and has thus already proven its utility and performance.

  3. Reviewer #1 (Public Review):

    The software presented in this paper is well documented and represents a significant achievement that breaks new ground in terms of what is possible to render and explore in the web browser. This tool is essential for the exploration of SC2 data, but equally useful for the tree of life and other tree-like data sets.

  4. Reviewer #2 (Public Review):

    This manuscript describes a web-based tool (Taxonium) for interactively visualizing large trees that can be annotated with metadata. Having worked on similar problems in the analysis and visualization of enormous SARS-CoV-2 data sets, I am quite impressed with the performance and "look and feel" of the Taxonium-powered cov2tree web interface, particularly its speed at rendering trees (or at least a subgraph of the tree).

    The manuscript is written well, although it uses some technical "web 2.0" terminology that may not be accessible to a general scientific readership, e.g., "protobuf" (presumably protocol buffer) and "autoscaling Kubernetes cluster". The latter is like referring to a piece of lab equipment, so the author should provide some sort of reference to the manufacturer, i.e., https://kubernetes.io/. In other respects, the manuscript lacks some methodological details, such as exactly how the tree is "sparsified" to reduce the number of branches being displayed for a given range of coordinates. Some statements are inaccurate or not supported by current knowledge in the field. For instance, it is not true that the phylogeny "closely approximates" the transmission tree for RNA viruses. Mutations are not associated with a "point in the phylogeny", but rather the branch that is associated with that internal node.

    A major limitation of displaying a single phylogenetic tree (albeit an enormous one) is that the uncertainty in reconstructing specific branches is not readily communicated to the user. This problem is exacerbated for large trees where the number of observations far exceeds the amount of data (alignment length). Hence, it would be very helpful to have some means of annotating the tree display with levels of uncertainty, e.g., "we actually have no idea if this is the correct subtree". DensiTree endeavours to do this by drawing a joint representation of a posterior sample of trees, but it would be onerous to map a user interface to this display. I'm raising this point as something for the developers to consider as a feature addition, and not a required revision for this manuscript.

    The authors make multiple claims of novelty - e.g., "[...] existing web-based tools [...] do not scale to the size of data sets now available for SARS-CoV-2" and "Taxonium is the only tool that readily displays the number of independent times a given mutation has occurred [...]" - that are not entirely accurate. For example, RASCL (https://observablehq.com/@aglucaci/rascl) allows users to annotate phylogenies to examine the repeated occurrence of specific mutations.
    Our own system, CoVizu, also enables users to visualize and explore the evolutionary relationships among millions of SARS-CoV-2 genomes, although it takes a very different approach from Taxonium. Taxonium is an excellent and innovative tool, and it should not be necessary to claim priority.

    Although the source code (largely JavaScript with some Python) is quite clean and has a consistent style, there is a surprising lack of documentation in the code. This makes me concerned about whether Taxonium can be a maintainable and extensible open-source project since this complex system has been almost entirely written by a single developer. For example, `usher_to_taxonium.py` has a single inline comment (a command-line example) and no docstring for the main function. `JBrowsePanel.jsx` has a single inline comment for 293 lines of code. There is some external documentation (e.g., `DEVELOPMENT.md`) that provides instructions for installing a development build, but it would be very helpful to extend this documentation to describe the relationships among the different files and their specific roles. Again, this is something for the developers to consider for future work and not the current manuscript.

  5. Reviewer #3 (Public Review):

    The paper succinctly provides an overview of the current approaches to generating and displaying super-large phylogenies (>10,000 tips). The results presented here provide a comprehensive set of tools to address the display and exploration of such phylogenies. The tools are well-described and comprehensive, and additional online documentation is welcome.

    The technical work to display such large datasets in a responsive fashion is impressive and this is aptly described in the paper. The author rightly decides that displaying large phylogenies is not simply a matter of rendering "more nodes", and so in my eyes, the major advancement is the approach used to downsample trees on-the-fly so that the number of nodes displayed at one time is manageable. This is detailed only briefly (Results section, 1st paragraph, 2 sentences). I would like to see more discussion about the details of this approach. Examples that came up while exploring the tool: the (well implemented) search functionality reports results from the entire tree (e.g. in Figure 4, the number of red circles is not a function of zoom level), how does this interact with a tree showing only a subset of nodes? How is the node order chosen with regards to "nodes that would be hidden by other nodes are excluded" and could this affect interpretations depending on the colouring used?

    Taxonium takes the approach of displaying all available data (sparsification of nodes notwithstanding). Biases in the generation of sequences, especially geographical, will therefore be present (especially so in the two main datasets discussed here - SARS-CoV-2 and monkeypox). This caveat should be made explicit. Has the author considered choosing which nodes to exclude for sparsified trees in such a way as to minimise known sampling biases?

    Interoperability between different software tools is discussed in a technical sense but not as it pertains to discovering the questions to ask of the data. As an example, spotting the specific mutations shown in figure 3 + 4 is not feasible by checking every position iteratively; instead, the ability to have mutations flagged elsewhere and then seamlessly explore them in Taxonium is a much more powerful workflow. This kind of interoperability (which Taxonium supports) enhances the claim of "providing insights into the evolution of the virus".

    Taxonium has been a fantastic resource for the analysis of SARS-CoV-2 and this paper fluently presents the tool in the context of the wider ecosystem of bioinformatic tools in use today, with the interoperability of the different pieces being a welcome direction.