Systematic interrogation of mutation groupings reveals divergent downstream expression programs within key cancer genes

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Genes implicated in tumorigenesis often exhibit diverse sets of genomic variants in the tumor cohorts within which they are frequently mutated. We sought to identify the downstream expression effects of these perturbations and to find whether or not this heterogeneity at the genomic level is reflected in a corresponding heterogeneity at the transcriptomic level. Applying a novel hierarchical framework for organizing the mutations present in a cohort along with machine learning pipelines trained on sample expression profiles we systematically interrogated the signatures associated with combinations of perturbations recurrent in cancer. This allowed us to catalogue the mutations with discernible downstream expression effects across a number of tumor cohorts as well as to uncover and characterize a multitude of cases where subsets of a genes mutations are clearly divergent in their function from the remaining mutations of the gene.

Article activity feed

  1. ###Reviewer #3

    This work presents a method to analyze integrated mutation and transcript data to identify mutations in individual genes that drive similar and divergent transcriptional signatures. Overall the work appears novel and provides potential insights that could generate hypotheses worthy of further study. The work is limited in that confirmation is done only for a set of mutations on GATA3 with existing drug sensitivity cell line data. It would be helpful to have an indication that more than a single result from the large study provides validated insights.

    Concerns:

    1. While the approach is nicely detailed, one critical aspect remains unclear. An AUC is generated for each prediction of mutation from transcriptional signature based on cross-validation. I could not deduce from this statement exactly how this was done given in the introduction here of a mean score: "We measured a classifier's ability to identify a transcriptomic signature for its assigned task using the area under the receiver operating characteristic curve metric (AUC) calculated using samples' mean scores across ten iterations of four-fold cross-validation."

    2. The claim "These results are striking in that predicting the presence of a rarer type of mutation should, everything else being equal, be more difficult owing to decreased statistical power" is really applicable to a hypothesis test, so it is not immediately obvious that is applies in a case of cross-validation generating an AUC.

    3. The claim that a Spearman correlation of AUCs between methods is a validation of robustness is difficult to accept. Note that if you uniformly subtracted 0.5 from every AUC, the result would give a Spearman correlation of 1 with the original data, but it would not be a very robust result. Why is Pearson correlation not used?

    4. It is clear that many classifiers were actually run, and it would be helpful to have the number actually summarized. This ties into the concern with only a single validation in drug sensitivity data, since there may be false discoveries given a large number of classifiers.

  2. ###Reviewer #2

    In this study, authors analyzed the association between types of somatic mutations and the downstream effects on the transcriptome using data obtained from many large tumor data consortia such as METABRIC, TCGA etc. Subsequently, authors systematically show functional relevance using CCLE data.

    Concerns:

    Using the tumor profiling data from various consortia, several groups have shown these associations using different statistical methodologies (PMID:21555372, PMID: 26436532, PMID: 27127206 and thereon). In that light, results described in this study are correlational and some are obvious. It is not clearly described what transcriptional programs are impacted by mutation subgroups and how distinct they are from other tumor types with similar mutation subgroups. Also, it is not clear if these distinct mutation subgroups carry any clinical significance such as outcomes. Furthermore, transcriptional programs are also under regulation by DNA methylation and its role in defining the transcriptional program under the influence of mutation subgroups is not described.

    Specific Concerns:

    1. What data normalization and batch correction methods were applied on expression data from TCGA, METABRIC and other datasets.

    2. What clustering methods were applied for subsequent UMAP projection.

    3. Although association between mutation sub-groups and expression is described, it is not clear if expression profile of a group of genes found in the analysis. If so, functional significance of those co-regulated genes is not described.

    4. Page 35 (lines 781-782); What is the biological and statistical rationale for removing neighborhood genes. There is significant neighborhood effect in certain cancers such as ccRCC where 3p is significant for tumorigenesis and progression.

    5. Statistical methods and reasons of their application on the data is not well described. Moreover, linearity in describing the methods on data from start is not clear thus leading to confusion. Multiple correction sections, although mentioned are vague.

    6. Earlier studies have shown concordance between RNA-Seq and microarrays. In that context, page 16; lines 348-351, why do the authors assume differences exist between these platforms.

    7. Manuscript is long and difficult to read with emphasis on some obvious things. Manuscript can be shortened for easy reading.

  3. ###Reviewer #1

    While this is an important area, the organization and results presentation render this current form of the manuscript unacceptable. Some specific challenges are described below.

    1. Throughout the manuscript, the authors report AUC on the training set as the primary metric of assessment and to compare models between genes. However, these performance metrics are more valid for cross-validation and may be sensitive to the differences in sample size introduced by the number of mutations. The authors would be better served by using the permutation-based statistic they develop later in the results throughout to report results.

    2. The authors develop a permutation based statistic to assess performance in a manner that controls for sample size presented as part of the results and relegate most of its description to the supplemental methods. This is a critical part of evaluation that should appear in the main manuscript and used for all results presented in the manuscript. This is of particular importance for the comparison between TCGA and METABRIC performance, which have different sample sizes.

    3. Several hypotheses about the function of specific mutations or mutational groupings are made throughout the manuscript based solely on the AUC prediction values. These appear speculative and could be better grounded in results by evaluating the function of the genes in the transcriptional programs that underlie the prediction (e.g., using feature importance scores to determine specific genes associated with the classifier.

    4. It is unclear why specific genes are selected for presentation in the manuscript. These appear cherry picked to describe well performing genes and do not do a comprehensive presentation of the performance of the algorithm, particularly in the first subsection of results "Subgrouping classifiers uncover alteration divergence in a breast cancer cohort" and "Subgrouping classifier output reveals the structure of downstream effects within cancer genes." The latter section particularly includes a substantial amount of biological description of function based solely on performance that is not grounded in the results presented.

    5. The definition of "subgroupings" is not clearly described. It is not possible to follow as written how the 7598 groupings are determined and how these are used in the machine learning framework. This needs to be significantly clarified.

    6. It is unclear why HER2 amplifications are a focus of analysis for Luminal A subtype breast cancer samples, which are by definition HER2-.

    7. An expanded presentation of the results of relative classification accuracy by gene and cancer type would be useful for evaluating the further impact of cancer-type on performance to determine the role of the biology on mediating mutations. In particular, it would be useful to evaluate whether cancers with different cell type composition (e.g.,large fibroblast content in messenchymal HPV- HNSCC tumors) impact the results of the classifier. A similar comparison would be useful between in vivo tumors and in vitro cancer types from the gene expression profiles in CCLE.

    8. The GitHub links for the software presented in this paper do not work.

  4. ##Preprint Review

    This preprint was reviewed using eLife’s Preprint Review service, which provides public peer reviews of manuscripts posted on bioRxiv for the benefit of the authors, readers, potential readers, and others interested in our assessment of the work. This review applies only to version 2 of the manuscript.

    ###Summary:

    The reviewers are in agreement that the authors present an innovative classifier framework to predict mutational status and subgroups based upon transcriptional profiles. They perform a comprehensive analysis across cancer subtypes to assess context-dependence of mutations and link these classifiers to cell line data to further predict therapeutic outcomes. Overall the work appears novel and provides potential insights that could generate hypotheses worthy of further study. While this is an important area, the work is limited in several ways. These include numerous issues with the statistical methods used, lack of clarity as to whether the results were significant, potential concern about cherry-picking results, and the need to consider alternative factors contributing to the reported relationships, coupled with weaknesses in the organization and presentation of the results.