Assessment of Gene Set Enrichment Analysis using curated RNA-seq-based benchmarks

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Pathway enrichment analysis is a ubiquitous computational biology method to interpret a list of genes (typically derived from the association of large-scale omics data with phenotypes of interest) in terms of higher-level, predefined gene sets that share biological function, chromosomal location, or other common features. Among many tools developed so far, Gene Set Enrichment Analysis (GSEA) stands out as one of the pioneering and most widely used methods. Although originally developed for microarray data, GSEA is nowadays extensively utilized for RNA-seq data analysis. Here, we quantitatively assessed the performance of a variety of GSEA modalities and provide guidance in the practical use of GSEA in RNA-seq experiments. We leveraged harmonized RNA-seq datasets available from The Cancer Genome Atlas (TCGA) in combination with large, curated pathway collections from the Molecular Signatures Database to obtain cancer-type-specific target pathway lists across multiple cancer types. We carried out a detailed analysis of GSEA performance using both gene-set and phenotype permutations combined with four different choices for the Kolmogorov-Smirnov enrichment statistic. Based on our benchmarks, we conclude that the classic/unweighted gene-set permutation approach offered comparable or better sensitivity-vs-specificity tradeoffs across cancer types compared with other, more complex and computationally intensive permutation methods. Finally, we analyzed other large cohorts for thyroid cancer and hepatocellular carcinoma. We utilized a new consensus metric, the Enrichment Evidence Score (EES), which showed a remarkable agreement between pathways identified in TCGA and those from other sources, despite differences in cancer etiology. This finding suggests an EES-based strategy to identify a core set of pathways that may be complemented by an expanded set of pathways for downstream exploratory analysis. This work fills the existing gap in current guidelines and benchmarks for the use of GSEA with RNA-seq data and provides a framework to enable detailed benchmarking of other RNA-seq-based pathway analysis tools.

Article activity feed

  1. (such as edgeR [48], DESeq [49], limma [33], and voom [50])

    can you be more specific here? which functions in these packages accomplish the task (e.x. I believe it is vst() for DESeq2). Can you also cite DESeq2 instead of DESeq? I think DESeq has been retired.

  2. Fig 2

    would you consider switching this plot to an upset plot (R packages upsetr or complexupset) instead of a venn diagram? For many intersections, upset plots are a bit easier to understand than venn diagrams

  3. Importantly, it must be pointed out that gsea-3.0.jar, utilized in protocols published by Reimand et al [37], is affected by serious security vulnerabilities due to the use of the Java-based logging utility Apache Log4j in GSEA versions earlier than 4.2.3. Moreover, as reported by the GSEA Team, version 3.0 contained microarray-specific code (mostly related to Affymetrix) that may cause issues with RNA-seq data analysis, which was removed in later GSEA updates.

    Did you do anything to account for these things in your analysis?

  4. An important challenge of pathway enrichment analysis is that of gene set overlap, where some genes participate in multiple gene sets [35, 36].

    I'm so glad you included this! I have struggled with this a lot in my own research so I'm so glad to see it explicitly mentioned here.

  5. Taken together, S2 Fig and S3 Fig show that differential expression/enrichment analyses derived from these different count normalization and filtering procedures lead to highly concordant results at both gene and pathway levels.

    This is very nice.

  6. positive-control pathways

    can you prepend this with "cancer-type-specific" so that its clear inline what this means without having to prematurely jump to a future section?

  7. harmonized

    Can you provide a little more context as to what this means? Are all samples consistently analyzed or is there some normalization that takes place as well?

  8. GSEA was run using the latest available version 4.3.2 (build 13, October 2022) [24].

    This sentence switches to passive from active voice, and the next sentence is active again ("We"). Would it be possible to make it active voice as well? without that, it doesn't sound like you ran the GSEA analysis and got it from somewhere else, which is a little confusing