Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear methods

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Advanced deep-learning methods, such as transformer-based foundation models, promise to learn representations of biology that can be employed to predict in silico the outcome of unseen experiments, such as the effect of genetic perturbations on the transcriptomes of human cells. To see whether current models already reach this goal, we benchmarked two state-of-the-art foundation models and one popular graph-based deep learning framework against deliberately simplistic linear models in two important use cases: For combinatorial perturbations of two genes for which only data for the individual single perturbations have been seen, we find that a simple additive model outperformed the deep learning-based approaches. Also, for perturbations of genes that have not yet been seen, but which may be “interpolated” from biological similarity or network context, a simple linear model performed as good as the deep learning-based approaches. While the promise of deep neural networks for the representation of biological systems and prediction of experimental outcomes is plausible, our work highlights the need for critical benchmarking to direct research efforts that aim to bring transfer learning to biology.

Article activity feed

  1. This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/14019384.

    In this preprint (v4) Ahlmann-Eltze, Huber, and Anders investigate whether sophisticated nonlinear ML models ("foundation models") pre-trained on single-cell RNA sequencing data (scRNA) can predict the effect of gene expression perturbations (e.g. CRISPR knockdown) on scRNA levels. Such models were fine-tuned to the application and then compared to far simpler, linear null models. Null models for predicting unseen double perturbation effects included (i) predicting no effect perturbation does not affect expression), and (ii) an additive model where the effect of a double mutation is the sum of the effects of the single mutants. For analysis of unseen single perturbations, the authors developed a PCA-based linear regression predicting perturbation effects from the correlation structure of the training data. They find that these null models generally outperform fine-tuned non-linear models for predicting both single and double perturbation effects. These results are summarized by the title — Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear methods.

    We have several high level concerns about this work.

    (I) The authors use different linear models for predicting single and double perturbation effects on mRNA levels. For example, the additive null model, which is the only substantive null model considered for double perturbations, can only be applied to double perturbations and only when the single perturbations are separately measured. For this reason the authors devise a distinct model, based on PCA, for predicting single perturbation effects. This model can predict the effects of double perturbations, but the performance of such predictions is not reported. This seems like a far more apt comparison, as the various non-linear models considered can be applied to single and double perturbations both.

    (II) The fine-tuning procedure applied to scFoundation, scGPT, and Gears, is not adequately described. We could not replicate the procedure if we attempted. We have some concerns here. (a) The linear models operate on pseudobulked data — i.e. per-condition averages of scRNA levels. The methods section does not make it clear whether average effects or per-cell effects were predicted from ML models like scFoundation, and some plots comparing ML models to nulls appear to make predictions for every cell rather than every condition (e.g. Fig 1C). If the deep learning models are indeed trained in such a fashion, then identical inputs (perturbations) are implicitly expected to have different outputs (expression levels), which is impossible in non-probabilistic neural networks. Training on this kind of data can lead to poor performance. (b) Relatedly, does the scRNA data have any value over bulk data here? None of the models are tested for their capacity to predict variance in RNA counts across cells in the same condition, for example. Perhaps the training on scRNA data is expected to enable models to learn the correlations needed to predict perturbation effects? This point should be made explicitly if so.

    (III) Very little of the fine-tuning procedure is described. Casual statements like "we limited the fine-tuning time to three days" are insufficient to understand or reproduce the work. How was this criterion chosen? What learning rate was used, and which learning rates were experimented with (using a held-out validation test)? How were batches constructed? Was an early stopping criterion applied? Without such crucial details, it is difficult to evaluate the work. Indeed, it is entirely possible that the results reported are due to overfitting of the non-linear models. 

    (IV) Does the pre-training of the foundation models make them appropriate to the tasks at hand? The authors should describe how the foundation models (and Gears) were trained and whether the effects of genetic perturbations were included in the pretraining data. If the training data included genetic perturbations or some other source of relevant data, then it seems important to demonstrate that fine-tuning improves model predictions of unseen perturbations (i.e. generalization). 

    Competing interests

    The authors declare that they have no competing interests.