NETMAGE: a humaN-disEase phenoType MAp GEnerator for the Visualization of PheWAS

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Summary

Given genetic associations from a PheWAS, a disease-disease network can be constructed where nodes represent phenotypes and edges represent shared genetic associations between phenotypes. To improve the accessibility of the visualization of shared genetic components across phenotypes, we developed the humaN-disEase phenoType MAp GEnerator (NETMAGE), a web-based tool that produces interactive phenotype network visualizations from summarized PheWAS results. Users can search the map by a variety of attributes, and they can select nodes to view information such as related phenotypes, associated SNPs, and other network statistics. As a test case, we constructed a network using UK BioBank PheWAS summary data. By examining the associations between phenotypes in our map, we can potentially identify novel instances of pleiotropy, where loci influence multiple phenotypic traits. Thus, our tool provides researchers with a means to identify prospective genetic targets for drug design, contributing to the exploration of personalized medicine.

Availability and implementation

Our service runs at https://hdpm.biomedinfolab.com . Source code can be downloaded at https://github.com/dokyoonkimlab/netmage .

Contact

dokyoon.kim@pennmedicine.upenn.edu

Supplementary information

Supplementary data and user guide are available at Bioinformatics online.

Article activity feed

  1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac002), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 3: Yaomin Xu

    The authors presented a web tool - NETMAGE that produces an interactive network-based visualization of disease cross-phenotype relationships based on PheWAS summary statistics. NETMAGE provides search functions for various attributes and selecting nodes to view related phenotypes, associated SNPs, and various network statistics. As a use case, authors used NETMAGE to construct a network from UK BioBank (UKBB) PheWAS summary statistic data. The purpose of the tool as claimed by the authors is to provide a holistic, network-based view for an intuitive understanding of the relationships between disease phenotypes and to help analyze the shared genetic etiology.

    Major comments:

    A DDN based on true genetic associations is useful for understanding complex disease comorbidities and their shared genetic etiology (pleiotropy). An interactive web tool to explore such a complex networked information could be highly useful for the proposed purposes of this tool. However, the EHR/Biobank PheWAS associations data are statistical in nature and commonly with small effect sizes. The reported genetic associations often are not well understood at the mechanistic level, and many genetic associations are spurious. Although certain positive findings can be observed from the disease network generated by NETMAGE, it's of concern the general usability of the current implementation of the tool in order to facilitate novel applications in drug design and personalized medicine, which requires the genetic associations to best represent the underlying true causal mechanism. Further work is needed to verify the genetic associations reported from PheWAS to minimize the impact of spurious associations. Network edges based on SNPs without considering the linkage disequilibrium (LD) between SNPs is misleading and could miss a significant portion of associations that should be linked between diseases if the LD correlations are considered. When construct the network using NETMAGE, the LD correlation between SNPs should be considered.

    For the reported DDN and its statistics to be relevant to true disease - disease relationships, the quality of disease diagnosis using Phecode should be considered. Phecodes are based on ICD codes that are known to be noisy. The accuracy of ICD can be as low as only 50%. Ignoring this limitation and treating disease diagnoses from Phecodes as gold standards or as precise and accurate may result in irrelevant and misleading findings.

    Phecodes are hierarchical. For example, parent codes are three digits (008), and each additional digit after decimal point indicates a subset of ICD codes of the parent code (008.5 and 008.52). So here a code 008.52 implies 008.5 also 008. What's the impact of this hierarchy to the NETMAGE network and the inferences to be made based on the network?

    Minor comments:

    On Page 9, you said "Out of the 2189 edges for which phi correlations could be calculated, 1811 (82.73%) appeared in the DDN. This behavior suggests that our genetic associations identified by our PheWAS results serve as a reasonable approximation of disease co-occurrences".

    This is expected because both phi correlation and PheWAS analyses were performed on the same dataset. If a pair of disease highly co-occur in the dataset, you would expect a strong correlation on their genetic associations analyzed on the same dataset. However, it may not be generalizable that the genetic associations from PheWAS are a reasonable approximation to disease co-occurrences. The disease-SNP relationships from the PheWAS analysis result are bipartite. Even though NETMAGE focuses on the projected disease-disease network, the information about how specific SNPs link to their corresponding disease pairs is important. For example, in your UKBB-based network (https://hdpm.biomedinfolab.com/ddn/ukbb), when a specific disease is selected, a subgraph of the selected disease and other disease linked to the selected one are showing, but sonly a lump of SNPs without linking to their specific disease pair is provided. This is not helpful. Also annotating those SNPs their genetic context could be very useful for users to quickly grasp the nature of the genetic associations in the subgraph.

  2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac002), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2: Dongjun Chung

    In this paper, the authors developed the humaN disease phenotype Map Generator (NETMAGE), a webbased tool that produces interactive disease-disease network visualization based on PhEWAS summary statistics. The tool proposed in this manuscript has important implication and utility for biological and clinical studies. The manuscript is also overall well-written and clearly described NETMAGE. However, there are still some aspects I hope the authors to address. I provide my comments in detail below.

    Major comments:

    1. I tried the web interface Human-Disease Phenotype Map (https://hdpm.biomedinfolab.com), which utilizes NETMAGE. I found that sometimes it takes some time for the network to appear. While the network is loaded, only the gray empty space with the side panel is shown. I recommend the authors to show the progress bar while loading the network, especially when it is first loaded, to avoid users to think that their web browser is frozen.

    2. In the Search bar, it is not always trivial to guess what to enter, especially for Phenotype Name, Associated SNPs, and category. Auto-completion features for these variables will significantly facilitate users' convenience.

    3. Meaning of edges is somewhat unclear to me. Are the existence and the weights of edges purely based on the number of shared SNPs or are they based on any statistical methods?

    4. When the weights of edges are calculated, are the marginal counts taken into account? The same number of shared SNPs can have different meanings when the disease to which this edge is connected has a small number of associated SNPs vs. a large number of associated SNPs. How is this factor considered?

    5. The network generated by the Human-Disease Phenotype Map (https://hdpm.biomedinfolab.com) is usually huge and complex with a large number of edges. As a result, it is often not straightforward to understand the generated network. This is partially relevant to the fact that the network layout is static, i.e., locations of nodes remain the same regardless of which subnetworks are chosen. If the network layout is optimized for each subnetwork, it should be much easier for users to understand the network architecture. Given this, I recommend the authors to consider updating the network layout interactively when a subnetwork is selected.

    6. When a subnetwork is chosen, the "Information Pane" appears. In this pane, it might be helpful for users if the authors provide some quick help link for each network score, e.g., how to interpret PageRank scores, etc.

    7. In the "Information Pane", a long list of SNPs is provided for "Associated SNPs" but it is not easy to use this list. I recommend the authors to make it downloadable as a table so that users can do downstream analysis. In addition, it will significantly facilitate users' convenience if each SNP ID is chosen, it brings the user to the relevant database, e.g., dbSNP. In this way, users can easily check where it is located in the sense of chromosome, gene, exon/intron/promoter/intergenic, etc. Alternatively, the authors can consider to use a quick information table (SNP ID, gene name, exon/intron/promoter/intergenic) instead of simply providing as a list.

  3. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giac002), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Sarah Gagliano Taliun

    Sriram et al. introduce an open-source web-based tool NETMAGE to produce interactive disease-disease network (DDN) visualizations of biobank-level phenome-wide association summary statistics. The concept is interesting and relevant, but my major concern is regarding the interpretability of the DDN for researchers and clinicians to draw insights intuitively.

    Comments on the manuscript:

    Generally well written and logical flow. Some minor errors (e.g. "an SNP" rather than "a SNP") and some headers could be improved for readability (e.g. "Testing" is vague; this section really only touches upon Run time).

    Figure 1- Displaying a single Manhattan plot for "PheWAS Summary Statistics" is not very intuitive. It makes me think of a single GWAS rather than a phenome-wide set of GWAS run on a Biobank. Perhaps revise the image.

    Is the disease-disease network only applicable to case/control studies? Could there be an extension to quantitative traits, and if so, would that be pertinent for discoveries?

    The authors refer to "SNPs" throughout to define genetic variation. If the summary statistics contains another type of variation (e.g. indels), are those associations still used? If so, I would suggest using a more generic term to define the genetic variation.

    The discussion seems underdeveloped. Discussion of limitations rather than only future work would be helpful.

    Case study-- The authors could improve the interpretability/discussion of the UKB PheWAS example. This is one of my largest concerns because the author state that the tool can help researchers and clinicians get insight into the underlying genetic architecture of disease complications; however, the case study part of the manuscript is quite technical and could be challenging to interpret for someone without network experience; e.g. Table 2.

    Additionally, more details should be provided on the underlying summary statistics used (e.g. some details can be found on the About page of the HRC-imputed UKB PheWeb page: https://pheweb.org/UKB-SAIGE/about).

    The authors list additional filtering that they performed on the summary statistics, but it appears that some details are missing. For instance, how many traits remain after the case count filtering is applied? Also, what is used as a reference for the LD-pruning in PLINK?

    Run time-- I am wondering why Table 3 (run time for subsets of the UKBB data) ends at 1000 phenotypes. It would be interesting to see the run time that is close to case example (e.g. possibly adding a column for the total number of phenotypes used in the UKBB DDN). Additionally, this section gives the impression that run time only depend on the number of phenotypes? I would assume that run time should also depend on the number of variants that were tested.

    Comments on the online tool:

    It is nice that on each page the authors have allowed users to download a pdf of the image and also the data behind the image (e.g. edge-map, node-map, etc.). The zoom-in feature for the visualization is also useful, as is the short video tutorial.

    I think that the search bar would be more user-friendly if suggestions automatically came up when the user begins to type. Additionally, displaying the list of "associated SNPs" in a (sortable and/or searchable) table (with some annotations, such as chr, position, closest gene, consequence, rather than just rsID) could be a neater and more informative way to show these data, rather than simply as it appears currently as a list in the "information pane".

    My comment on interpretability for researchers and clinicians comes up again: I am not sure how useful/interpretable some of the search categories are for users to intuitively draw insights; for instance, number of triangles, page range, etc. I think the authors should really focus on the intuitiveness for the target audience so that the tool can have more impact.