Gene expression has more power for predicting in vitro cancer cell vulnerabilities than genomics

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Summary: The authors propose a new approach to the derivation of cancer signatures and compare the relative impact of gene expression data with respect to other variables, particularly SVN and CVNs. The simplicity of the idea and of the technical approach, to the point of singling out a single gene predictive value, is a positive aspect. There are also critical aspects that will require substantial revision including the underlying influence of tissue specific genes. Overall, the paper provides a good basis for the generation of specific hypotheses that can be followed by additional validation studies at the computational and/or experimental level.

This article has been Reviewed by the following groups

Read the full article

Abstract

Achieving precision oncology requires accurate identification of targetable cancer vulnerabilities in patients. Generally, genomic features are regarded as the state-of-the-art method for stratifying patients for targeted therapies. In this work, we conduct the first rigorous comparison of DNA- and expression-based predictive models for viability across five datasets encompassing chemical and genetic perturbations. We find that expression consistently outperforms DNA for predicting vulnerabilities, including many currently stratified by canonical DNA markers. Contrary to their perception in the literature, the most accurate expression-based models depend on few features and are amenable to biological interpretation. This work points to the importance of exploring more comprehensive expression profiling in clinical settings.

Article activity feed

  1. Reviewer #3:

    In this manuscript, Dempster et al. analysed the predictability of cell viability from baseline genomics and transcriptomics based features. They did a comprehensive analysis across feature and perturbation types, which gives a valuable contribution to the field. The main findings of the paper (gene expression based features outperform genomics based ones) are not necessarily new, but the authors also show the interpretability of gene expression based features, which clearly helps to place these machine learning (ML) models into biological context . This is especially important for the possible translatability, as small (low number of features), interpretable models are generally preferred over large, "black box" models.

    The study is very nicely constructed both from machine learning and cancer biology perspective. My only major comments are regarding some (potential confounding) factors related to tissue-type and feature filtering.

    Major comments:

    1. A well known phenomenon on the field is the tissue-type specificity of drug sensitivity, which is a major confounding factor in several ML-based studies. The authors, absolutely correctly, use tissue-type as features in their models to overcome this problem. However, as RF models (individual trees) do not use all features at the same time, so it is possible that some genomics based models are not using information about tissue-type, even if tissue-type was selected in the 1,000 features. On the other side, for gene expression based models (based on the "tissue specificity of gene expression"), tissue-type information is probably always available. This could (partially) cause the better performance of gene expression features. Could the authors do some additional controls (e.g.: providing "multiple copies" of tissue-type features for genomics based models) to overcome this potential confounding factor?

    2. The authors use a Pearson correlation filter (mainly) to decrease computational time. In Figure 4 (and also inFigure 2 - supplement 3) they show that in case of "combined" features, the features sets including gene expression based features had the best performance. When did they use the Pearson filter in case of combined features, before or after combining them? I.e. in case of expression + mutation, they selected the top 1,000 expression and top 1,000 mutation features, combined them and trained RF models with 2,000 features, or combined expression and mutation features, selected the top 1,000 features, and trained the models with them? If the later, it would be important to see how much of the different feature classes (e.g.: mutation and expression in my example) are included in the top 1,000 features. This is especially important, as Pearson correlation as a filter is probably more suitable for continuous (expression) than binary (mutation) features, so it is possible that the combined features use mostly expression based features. In this case, it is not so surprising that the performance of combined feature models are more close to expression based models.

  2. Reviewer #2:

    Summary and comments:

    This study presents an analysis of five large datasets of cancer cell viability including both genetic and chemical perturbations and find that RNA-seq expression outperforms DNA-derived features in predicting cell viability response in nearly all cases, and the best results are typically driven by a small number of interpretable expression features. The authors suggest that both existing and new cancer targets are frequently better identified using RNA-seq gene expression than any combination of other cancer cell properties.

    Overall, none of the main conclusions in the paper are surprising, and begs the question whether sequencing more cancer exomes is really meaningful? This is a question that deserves serious debate as major resources are being diverted to large-scale exome sequencing projects with low information content returns.

    The paper is well written. And at first glance, the results seem to support the provocative title. Improved clarity around what the predictor and response variables are earlier in the paper would improve readability, particularly around what a "genomic" variable is. Most of the helpful details are buried in the methods.

    The main benefits of the manuscript: (1) emphases on simplistic (i.e. few features) predictors that are themselves easily interpretable; and (2) the choice of random forest classifiers also makes interpretation of the predictions pretty straightforward. One concern is whether the breadth or depth (i.e. completeness) of the genomic predictor variables somehow unfairly bias the findings against the ability to predict with those variables compared to expression variables, which are quite easy to encode and interpret. This concern is alluded to in the discussion when reviewing the findings of previous, related publications, and could be further explored. For instance, while variations in mutant RAS (H, K or N) or B-RAF were the only dependencies noted to be predicted better by genomics (i.e. mutations), are all driver mutations known and represented in the data? One would expect that amplified EGFR or HER2 would be predicted well from genomics, but these are notably missing, presumably because they do not meet the filtering criteria.

    A notable finding was that a single genes' expression data produced notably better results than gene set enrichment scores overall, despite having many more presumably irrelevant features. Predictive models for many vulnerabilities exhibit relationships expected to be specific to a single genes' expression (e.g. los of a paralog's expression predicts dependency on its partner). There is no biological validation for any of the predictions in the manuscript.

    Specific comments:

    What was the thought process for choosing 100 perturbations in each dataset to label as SSVs? Why not 82, or 105? Was there a systematic analysis done to pick this number (e.g. harmonic mean)?

    Did the authors estimate the effect size they are measuring across the 100 selected SSVs? In other words, was there an estimation of fitness effect, single mutant fitness or degree of essentiality for the 100 SSVs and what range of effects are they exploring? One possible way to measure the fitness effect of each of the 100 vulnerabilities is to examine the dropout rate in pooled screens at the guide RNA level, looking for consistency in gRNA behaviour.

    Did the authors include essential and non-essential genes as reference points in their analyses? This wasn't clear from the methods.

    The authors describe a clear gradation of response to either TP53 or MDM2 knockout according to the magnitude of EDA2R expression observed in multiple datasets (i.e. Achilles, Project Score, RNAi, GDSC17, PRISM). Using EDA2R expression to infer TP53 activity could have clinical benefit and deserves more attention (i.e. validation).

  3. Reviewer #1:

    General assessment:

    Since Precision Oncology is getting important these days, understanding the relationship between cancer type-specific vulnerabilities and their biomarker is a major challenge of personalized therapy. Previously, genomic signatures such as mutation and copy number variation were favorable to predicting cancer vulnerabilities. Dempster et al. presented a systematic comparison of predictions with or without gene expression features using five major screen data sets, suggesting that gene expression would better predict cancer vulnerabilities. Although suggested interpretable models in the last part of the paper are questionable, the main message and the supportive comparisons are clear.

    Major comments:

    1. RNA expression cannot be separated from cell lineage bias. For example, ESR1 gene is also relatively overexpressed in normal female tissues. I'm wondering how overexpression specific dependency can be separated from the tissue bias.

    2. Predicting drug response by expression signature might be risky if there is no clear copy number amplification signature or reasonable causality. Is it possible to find casual features of why a gene is overexpressed?

    3. In this paper, the authors presented that EDA2R expression is the top feature of predicting the TP53 dependency and MDM2 inhibitors' response as an example of interpretable models. However, many studies have confirmed MDM2 phenotype depends upon TP53 genomic status. Similarly, the response of MDM2 inhibitors can be explained by TP53 mutational status. I'm curious whether the prediction of MDM2 dependency using EDA2R expression status shows a better prediction than the prediction using TP53 mutational status in statistics.

  4. Summary: The authors propose a new approach to the derivation of cancer signatures and compare the relative impact of gene expression data with respect to other variables, particularly SVN and CVNs. The simplicity of the idea and of the technical approach, to the point of singling out a single gene predictive value, is a positive aspect. There are also critical aspects that will require substantial revision including the underlying influence of tissue specific genes. Overall, the paper provides a good basis for the generation of specific hypotheses that can be followed by additional validation studies at the computational and/or experimental level.