Phantasus: web-application for visual and interactive gene expression analysis

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    This study presents a useful tool called Phantasus, a web application to analyze gene expression data generated by microarray or RNA-seq technologies. The web application will help biologists end users, and non-bioinformatics experts to analyze new data or replicate transcriptomic studies. Local use of the Phantasus through its Bioconductor package reveals an incomplete functionality concerning the current best practices in analyzing bulk RNA-seq data.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Transcriptomic profiling became a standard approach to quantify a cell state, which led to accumulation of huge amount of public gene expression datasets. However, both reuse of these datasets or analysis of newly generated ones requires a significant technical expertise. Here we present Phantasus – a user-friendly web-application for interactive gene expression analysis which provide a streamlined access to more than 84000 public gene expression datasets, as well as allows analysis of user-uploaded datasets. Phantasus integrates an intuitive and highly interactive JavaScript-based heatmap interface with an ability to run sophisticated R-based analysis methods. Overall Phantasus allows to go all the way from loading, normalizing and filtering data to doing differential gene expression and downstream analysis. Phantasus can be accessed on-line at https://ctlab.itmo.ru/phantasus or https://artyomovlab.wustl.edu/phantasus or can be installed locally from Bioconductor ( https://bioconductor.org/packages/phantasus ). Phantasus source code is available at https://github.com/ctlab/phantasus under MIT licence.

Article activity feed

  1. Author response:

    Reviewer #3 (Public Review):

    Software UX design is not a trivial task and a point-and-click interface may become difficult to use or misleading when such design is not very well crafted. While Phantasus is a laudable effort to bring some of the out-of-the box transcriptomics workflows closer to the broader community of point-and-click users, there are a number of shortcomings that the authors may want to consider improving.

    Thank you for such an in-depth review. We really appreciate this feedback and have tried to address all of the concerns in the new version of Phantasus.

    Here I list the ones I found running Phantasus locally through the available Bioconductor package:

    (1) The feature of loading in one click one of the thousands of available GEO datasets is great. However, one important use of any such interfaces is the possibility for the users to analyze his/her own data. One of the standard formats for storing tables of RNA-seq counts are CSV files. However, if we try to upload from the computer a CSV file with expression data, such as the counts stored in the file GSE120660_PCamerge_hg38.csv.gz from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120660, a first problem is that the system does not recognize that the CSV file is compressed. A second problem is that it does not recognize that values are separated by commas, the very original CSV format, giving a cryptic error "columnVector is undefined". If we transform the CSV format into tab-separated values (TSV) format, then it works, but this constitutes already a first barrier for the target user of Phantasus.

    Thank you for highlighting this issue of file formats support. We acknowledge the commonality of CSV and CSV.gz files in gene expression analysis. As a response, we have updated our data loading procedure to support these file formats. Moreover, the most recent version of our web application is able to recognize gzip-archived file in any of supported table formats: GCT, TSV, CSV and XLSX.

    (2) Many RNA-seq processing pipelines use Ensembl annotations, which for the purpose of downstream interpretation of the analysis, need to be translated into HUGO gene symbols. When I try to annotate the rows to translate the Ensembl gene identifiers, I get the error

    "There is no AnnotationDB on server. Ask administrator to put AnnotationDB sqlite databases in cacheDir/annotationdb folder"

    Thank you for revealing this issue. Indeed, locally installed instances of the Phantatus might lose some functionality in absence of some auxiliary files. For example, gene annotation mapping is unavailable without annotation databases. Previously, the user had to perform additional setup steps to unlock a few features, which might be confusing and unclear. In order to overcome this we have revised significantly the installation procedure. Newly added ‘setupPhantasus’ function is able to create all necessary configuration files and provides an interactive dialog with the user that helps to load all necessary data files from our official cache mirror (https://alserglab.wsutl.edu/files/phantasus/minimal-cache/). Docker-based installation follows the same approach, however it is configured to install everything by default. Thus, with help of the new installation procedure locally installed Phantasus now has the whole functionality available at the official mirrors. The comprehensive installation description is now available at https://ctlab.github.io/phantasus-doc/installation.

    (3) When trying to normalize the RNA-seq counts, there are no standard options such as within-library (RPKM, FPKM) or between-library (TMM) normalization procedures.

    Appreciating your feedback, we've expanded the available normalization options in the updated version of Phantasus. We added support for TMM normalization as suggested by the edgeR package and voom normalization from the limma package. However, certain strategies like RPKM/FPKM or TPM rely on gene-specific effective lengths, which are challenging to infer without protocol and alignment details. As Phantasus operates on gene expression matrices and doesn't execute alignment steps, the implementation of these normalization seems infeasible. On the other hand, if the user has the matrix with FPKM or TPM gene values (for example from a core facility), such a matrix can be loaded into Phantasus and used for the analysis.

    If I take log2(1+x) a new tab is created with the normalized data, but it's not easy to realize what happened because the tab has the same name as the previous one and while the colors of the heatmap changed to reflect the new scale of the data, this is quite subtle. This may cause that an unexperienced user to apply the same normalization step again on the normalized data. Ideally, the interface should lead the user through a pipeline, reducing unnecessary degrees of freedom associated with each step.

    Thank you for your comment. Indeed our approach to create a new tab for each alteration to the expression values preserving the name might be the source of confusion for a user. On the other hand, generating informative tab names without overwhelming users with too much detail is also challenging. As a compromise we have an option for the user to manually rename the tab. Still, we agree that this remains an area for improvement. We also consider it to be a part of a larger issue: for example, the loaded data can already be log-scaled, so that even one round of log-scale transformation in Phantasus would be incorrect. Accordingly, we are exploring ways to address this issue in the future by adding automated checks for the tools or, as you suggested, implementing stricter pipelines.

    (4.4) Phantasus allows one to filter out lowly-expressed genes by averaging expression of genes across samples and discarding/selecting genes using some cutoff value on that average. This strategy is fine, but to make an informed decision on that cutoff it would be useful to see a density plot of those averages that would allow one to identify the modes of low and high expression and decide the cutoff value that separates them.

    Thank you for the suggestion. Indeed a density plot might help users to make informed decisions during gene filtration. We have added such a plot into the ‘Plot/Chart’ tool as a ‘histogram’ chart type.

    It would be also nice to have an interface to the filterByExpr() function from the edgeR package, which provides more control on how to filter out lowly-expressed genes.

    Thank you for proposing the inclusion of an interface for the filterByExpr() function from the edgeR package. In the recent update we have incorporated filterByExpr() as part of the voom normalization tool. For now, for simplicity, we have decided to keep only the default parameter values. However, we will explore the addition of the dedicated filtering tool in the future.

    (5) When attempting a differential expression (DE) analysis, a popup window appears saying:

    "Your dataset is filtered. Limma will apply to unfiltered dataset. Consider using New Heat Map tool."

    One of the main purposes of filtering lowly-expressed genes is mainly to conduct a DE analysis afterwards, so it does not make sense that the tool says that such an analysis will be done on the unfiltered dataset. The reference to the "New Heat Map tool" is vague and unclear where should the user look for that other tool, without any further information or link.

    Thank you for highlighting this issue. We agree that the message in the popup window and the default action were confusing. In response to your feedback, we've updated the default behavior of our DE tools to automatically use the filtered data in a new tab. Additionally, we've clarified the warning message to ensure a better understanding of this process.

    (6) The DE analysis only allows for a two-sample group comparison, which is an important limitation in the question we may want to address. The construction of more complex designs could be graphically aided by using the ExploreModelMatrix Bioconductor package (Soneson et al, F1000Research, 2020).

    Indeed, the ability to create complex designs and various comparisons is important for many applications for gene expression analysis. Accordingly, in the latest Phantasus version, we've introduced an advanced design feature for the DE analysis, enabling the utilization of multiple column annotations for the design matrix. Combined with the existing ability to create new annotations, this update facilitates the setup of diverse design matrices. While at the moment we do not allow setting a complex contrast, we hope that the current interface will cover most of the differential expression use cases.

    (7) When trying to perform a pathway analysis with FGSEA, I get the following error:

    "Couldn't load FGSEA meta information. Please try again in a moment. Error: cannot open the connection In call: file(file, "rt")

    We hope that this issue should be resolved after we have implemented a more streamlined setup process. Among others, the new approach aims to eliminate the unexpected absence of metafiles in local installations. The latest Phantasus package version explicitly prompts the user to load necessary additional files automatically during the initial run, reducing options for an invalid setup.

    Finally, there have been already some efforts to approach R and Bioconductor transcriptomics pipelines to point-and-click users, such as iSEE (Rue-Albrecht et al, 2018) and GeneTonic (Marini et al, 2021) but they are not compared or at least cited in the present work.

    Indeed, our comparison was focused toward tools that offer non-programmatic functionalities for gene expression data analysis. While tools like iSEE and GeneTonic are adept at visualizing data and hold their own in providing extensive abilities, they do necessitate additional data preparation using R, distinguishing them from the specific scope of tools we assessed.

    One nice features of these two tools that I missed in Phantasus is the possibility of generating the R code that produces the analysis performed through the interface. This is important to provide a way to ensure the reproducibility of the analyses performed.

    The ability to generate R code within tools like these indeed aids in ensuring analysis reproducibility. Moreover, we have previously attempted implementing this functionality in Phantasus, however it proved to be hard to do in a useful fashion due to potential complex interactions between user and the client-side part of Phantasus. Nevertheless, we acknowledge the significance of such a feature and aim to introduce it in the future.

  2. eLife assessment

    This study presents a useful tool called Phantasus, a web application to analyze gene expression data generated by microarray or RNA-seq technologies. The web application will help biologists end users, and non-bioinformatics experts to analyze new data or replicate transcriptomic studies. Local use of the Phantasus through its Bioconductor package reveals an incomplete functionality concerning the current best practices in analyzing bulk RNA-seq data.

  3. Reviewer #1 (Public Review):

    Maksim Kleverov et al. developed the tool called Phantasus, a web application for matrix visualization and analysis of gene expression data generated by either microarray or RNA-seq technologies. By Phantasus, the users can load, normalize, and plot their own data or those available in public databases and investigate the samples to remove outliers before the differential expression analysis.

    Phantasus can be accessed on-line or can be installed locally from Bioconductor.
    One of the advantages of the web application is that it combines an interactive graphical user interface with access to various R-based analysis methods. For the methods that rely on functions that are already available in the existing R packages, for such practices, only wrapper R functions are implemented. The tool was developed focusing on being helpful to both expert and non-expert users in bioinformatic gene expression analysis.

  4. Reviewer #2 (Public Review):

    Maksim et al. present Phantasus, a web application for interactive gene expression analysis. The tool allows the user to load microarrays and RNA-Seq data from NCBI GEO.
    The user is able to explore, normalize, filter and perform differential expression analysis using limma or DESeq2 pipelines for microarray and RNA-Seq data, respectively. The web tool is capable of generating figures such as PCA and volcano plots and performing gene set enrichment analysis. Phantasus has some advantages when compared to the set of tools already available, showing a good trade-off between ease of use, access to data and different functions. Furthermore, the application is open source and the pre-processed cache files are provided by the authors. Thus, the more experienced user can install the tool on a local computer.

    Finally, Phantasus is limited to standardized analyzes available in its internal methods and databases, which may not meet the needs of researchers who wish to apply different types of quantification and normalization. However, this is the ideal tool for the non-bioinformatics user who wants to reanalyze public data or perform simple differential expression analyzes on their own data.

  5. Reviewer #3 (Public Review):

    Software UX design is not a trivial task and a point-and-click interface may become difficult to use or misleading when such design is not very well crafted. While Phantasus is a laudable effort to bring some of the out-of-the box transcriptomics workflows closer to the broader community of point-and-click users, there are a number of shortcomings that the authors may want to consider improving. Here I list the ones I found running Phantasus locally through the available Bioconductor package:

    1. The feature of loading in one click one of the thousands of available GEO datasets is great. However, one important use of any such interfaces is the possibility for the users to analyze his/her own data. One of the standard formats for storing tables of RNA-seq counts are CSV files. However, if we try to upload from the computer a CSV file with expression data, such as the counts stored in the file GSE120660_PCamerge_hg38.csv.gz from https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120660, a first problem is that the system does not recognize that the CSV file is compressed. A second problem is that it does not recognize that values are separated by commas, the very original CSV format, giving a cryptic error "columnVector is undefined". If we transform the CSV format into tab-separated values (TSV) format, then it works, but this constitutes already a first barrier for the target user of Phantasus.

    2. Many RNA-seq processing pipelines use Ensembl annotations, which for the purpose of downstream interpretation of the analysis, need to be translated into HUGO gene symbols. When I try to annotate the rows to translate the
    Ensembl gene identifiers, I get the error

    "There is no AnnotationDB on server. Ask administrator to put AnnotationDB sqlite databases in cacheDir/annotationdb folder"

    3. When trying to normalize the RNA-seq counts, there are no standard options such as within-library (RPKM, FPKM) or between-library (TMM) normalization procedures. If I take log2(1+x) a new tab is created with the normalized data, but it's not easy to realize what happened because the tab has the same name as the previous one and while the colors of the heatmap changed to reflect the new scale of the data, this is quite subtle. This may cause that an unexperienced user to apply the same normalization step again on the normalized data. Ideally, the interface should lead the user through a pipeline, reducing unnecessary degrees of freedom associated with each step.

    4. 4. Phantasus allows one to filter out lowly-expressed genes by averaging expression of genes across samples and discarding/selecting genes using some cutoff value on that average. This strategy is fine, but to make an informed decision on that cutoff it would be useful to see a density plot of those averages that would allow one to identify the modes of low and high expression and decide the cutoff value that separates them. It would be also nice to have an interface to the filterByExpr() function from the edgeR package, which provides more control on how to filter out lowly-expressed genes.

    5. When attempting a differential expression (DE) analysis, a popup window appears saying:

    "Your dataset is filtered. Limma will apply to unfiltered dataset. Consider using New Heat Map tool."

    One of the main purposes of filtering lowly-expressed genes is mainly to conduct a DE analysis afterwards, so it does not make sense that the tool says that such an analysis will be done on the unfiltered dataset. The reference to the "New Heat Map tool" is vague and unclear where should the user look for that other tool, without any further information or link.

    6. The DE analysis only allows for a two-sample group comparison, which is an important limitation in the question we may want to address. The construction of more complex designs could be graphically aided by using the ExploreModelMatrix Bioconductor package (Soneson et al, F1000Research, 2020).

    7. When trying to perform a pathway analysis with FGSEA, I get the following error:

    "Couldn't load FGSEA meta information. Please try again in a moment. Error: cannot open the connection In call: file(file, "rt")

    Finally, there have been already some efforts to approach R and Bioconductor transcriptomics pipelines to point-and-click users, such as iSEE (Rue-Albrecht et al, 2018) and GeneTonic (Marini et al, 2021) but they are not compared or at least cited in the present work. One nice features of these two tools that I missed in Phantasus is the possibility of generating the R code that produces the analysis performed through the interface. This is important to provide a way to ensure the reproducibility of the analyses performed.