DivBrowse – interactive visualization and exploratory data analysis of variant call matrices

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

The sequencing of whole genomes is becoming increasingly affordable. In this context large-scale sequencing projects are generating ever larger datasets of species-specific genomic diversity. As a consequence, more and more genomic data needs to be made easily accessible and analyzable to the scientific community.

Findings

We present DivBrowse, a web application for interactive visualization and exploratory analysis of genomic diversity data stored in Variant Call Format (VCF) files of any size. By seamlessly combining BLAST as an entry point together with interactive data analysis features such as principal component analysis in one graphical user interface, DivBrowse provides a novel and unique set of exploratory data analysis capabilities for genomic biodiversity datasets. The capability to integrate DivBrowse into existing web applications supports interoperability between different web applications. Built-in interactive computation of principal component analysis allows users to perform ad-hoc analysis of the population structure based on specific genetic elements such as genes and exons. Data interoperability is supported by the ability to export genomic diversity data in VCF and General Feature Format (GFF3) files.

Conclusion

DivBrowse offers a novel approach for interactive visualization and analysis of genomic diversity data and optionally also gene annotation data by including features like interactive calculation of variant frequencies and principal component analysis. The use of established standard file formats for data input supports interoperability and seamless deployment of application instances based on the data output of established bioinformatics pipelines.

Article activity feed

  1. Background

    Reviewer 2: Armin Scheben

    The authors present the web app DivBrowse for visualizing genomic variant data. Their code is publicly available, and their web app is well-documented and provides several demonstration implementations for human, mouse and barley. The manuscript is well-written and concisely covers the key features of DivBrowse and summarizes the implementation of the software.

    I was able to test the demonstration website and was impressed with how smoothly everything ran and was set up. Due to time constraints, I was not able to test the installation and set up of DivBrowse but the documentation looks sufficient to allow easy set up by experts. Overall, I think this is a useful contribution to the community. One key issue I believe the authors should address, however, is that the manuscripts presents DivBrowse in a vaccum, not providing much mention of or comparison with existing software with overlapping functionality. Below I provide some further details illustrate my point and how it might be addressed, as well as listing several other minor comments.

    Main comment

    The authors rightly indicate in their introduction that the growing amounts of genomic data generated require robust solutions for visualization and exploration that does not require use of the command-line. But the authors fail to mention that there exists a considerable ecosystem of software that already does this. Moreover, some of the software available offers substantially expanded features compared to DivBrowse.

    To help readers better decide when DivBrowse might be the right choice for their needs compared to other options, the authors could cite existing software and provide some comparison. My knowledge of all available software is not exhaustive, but Wang et al. 2020 (https://doi.org/10.1093/gigascience/giaa060) in their publication of SnpHub provide a comparison table including SnpHub itself and Jbrowse. I would consider both of these tools for exploration and visualization of SNPs and additional data, similar to DivBrowse. Jbrowse is relatively widely used and considerably more feature-rich. The standalone offline tool TASSEL (https://academic.oup.com/bioinformatics/article/23/19/2633/185151) also offers many options for visualisation and exploration and analysis of VCF data offline. There may also be other tools I am not aware of, and readers would likely benefit from some brief overview of the landscape and the pros and cons of each piece of software and what differentiates DivBrowse.

    Minor comments

    The authors can consider the minor comments below as 'take it or leave it' comments. I do not think it is essential to address these, but in my view they may enhance the manuscript.

    1. In the discussion, the authors point out the efficiency and low latency of DivBrowse, however this is not quantified in the manuscript. If it were technically feasible without substantial effort, it might be useful to quantify in some way just how efficient DivBrowse can be, especially if this could be one of the stand-out features of DivBrowse.

    2. The authors use divergence Bezier curves to increase the amount of variant calls that can be visualized. This is helpful and a useful default. However, invariant sites can also be of considerable evolutionary and breeding/medicinal interest. When collapsing invariant sites, they become indistinguishable from unmapped regions. This is a fundamental issue and many VCF files may not encode information on invariant sites, so it may not be possible to develop robust functionality that allows users to also show invariant sites optionally. Still, this point may be worth briefly mentioning in the discussion, if the authors agree it is noteworthy.

    3. One advantage of visualization of relatively raw data like SNPs is that it can reveal patterns that are less obvious in other types of data exploration. To fully take advantage of this tools like Jbrowse allow export of the browser window in SVG format, allowing users to incorporate images into high-resolution figures. I don't expect the authors to necessarily implement this feature for this review, but it may be worth adding it to the list of potential enhancements that could be implemented based on user demand.

  2. Background

    This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giad025), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Weilong Guo, PhD

    Patrick König and colleagues have built a web application for the interactive query, visualization and analysis of genomic diversity data, supportting population structure analysis on specific genetic elements, and data export. The application can also be easily used as a plugin for existing web application. According to its documentation, this application can be easily installed form pip, Docker and conda, which would be useful for population genomic studies. There are still several concerns about this manuscript.

    Major concerns:

    1. As for the SNP visualization function, there are only very limited numbers of SNPs can be read on the webpage, without function such as "zoom in" or "zoom out"(it is suggested to add such functions or similar functions). Although the application can export almost all the SNP sites of a whole VCF file, it is far from user-friendly.It is suggested to add a track of chromosomes showing the genomic windows under querying, allowing the cursor to select or adjust the genomic regions (UCSC-browser style), which is necessary for an intuitive user experience.

    2. The BLAST function could serve as a useful entry point. But what is the starting position of the query sequence when mapped on minus strand? The authors should make it more clearly explained on the website.

    3. TThe authors mentioned that their application would convert the inputted VCF file into Zarr format. Thus, more performance evaluation should be declared to show the advantages of this strategy (rather than using the VCF file directly).

    4. The authors should also compared the their applications with other similar existing web applications, such as CanvasDB, Gigwa, SNiPlay and SnpHub, to highlight their advantages and improvemences.

    Minor concerns:

    1. The analysis functions are still insufficient. Commonly used analysis tools or methods, such as haplotype analysis, STRUCTURE analysis, distribution of nucleotide diversity and selection sweep analysis, are also suggested to be supported.

    2. Ref. 22 is not completed.