Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear baselines

Abstract

Advanced deep-learning methods, such as foundation models, promise to learn representations of biology that can be employed to predict in silico the outcome of unseen experiments, such as the effect of genetic perturbations on the transcriptomes of human cells. To see whether current models already reach this goal, we benchmarked five foundation models and two other deep learning models against deliberately simplistic linear baselines. For combinatorial perturbations of two genes for which only the individual single perturbations had been seen, we find that the deep learning-based approaches did not perform better than a simple additive model. For perturbations of genes that had not yet been seen, the deep learning-based approaches did not outper-form the baseline of predicting the mean across the training perturbations. We hypothesize that the poor performance is partially because the pre-training data is observational; we show that a simple linear model reliably outperforms all other models when pre-trained on another perturbation dataset. While the promise of deep neural networks for the representation of biological systems and prediction of experimental outcomes is plausible, our work highlights the need for clear setting of objectives and for critical benchmarking to direct research efforts.

Contact

constantin.ahlmann@embl.de

This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/14019384.

In this preprint (v4) Ahlmann-Eltze, Huber, and Anders investigate whether sophisticated nonlinear ML models ("foundation models") pre-trained on single-cell RNA sequencing data (scRNA) can predict the effect of gene expression perturbations (e.g. CRISPR knockdown) on scRNA levels. Such models were fine-tuned to the application and then compared to far simpler, linear null models. Null models for predicting unseen double perturbation effects included (i) predicting no effect perturbation does not affect expression), and (ii) an additive model where the effect of a double mutation is the sum of the effects of the single mutants. For analysis of unseen single perturbations, the authors developed a PCA-based linear regression predicting perturbation effects from the correlation structure of the training data. They find that these null models generally outperform fine-tuned non-linear models for predicting both single and double perturbation effects. These results are summarized by the title — Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear methods.

We have several high level concerns about this work.

(I) The authors use different linear models for predicting single and double perturbation effects on mRNA levels. For example, the additive null model, which is the only substantive null model considered for double perturbations, can only be applied to double perturbations and only when the single perturbations are separately measured. For this reason the authors devise a distinct model, based on PCA, for predicting single perturbation effects. This model can predict the effects of double perturbations, but the performance of such predictions is not reported. This seems like a far more apt comparison, as the various non-linear models considered can be applied to single and double perturbations both.

(II) The fine-tuning procedure applied to scFoundation, scGPT, and Gears, is not adequately described. We could not replicate the procedure if we attempted. We have some concerns here. (a) The linear models operate on pseudobulked data — i.e. per-condition averages of scRNA levels. The methods section does not make it clear whether average effects or per-cell effects were predicted from ML models like scFoundation, and some plots comparing ML models to nulls appear to make predictions for every cell rather than every condition (e.g. Fig 1C). If the deep learning models are indeed trained in such a fashion, then identical inputs (perturbations) are implicitly expected to have different outputs (expression levels), which is impossible in non-probabilistic neural networks. Training on this kind of data can lead to poor performance. (b) Relatedly, does the scRNA data have any value over bulk data here? None of the models are tested for their capacity to predict variance in RNA counts across cells in the same condition, for example. Perhaps the training on scRNA data is expected to enable models to learn the correlations needed to predict perturbation effects? This point should be made explicitly if so.

(III) Very little of the fine-tuning procedure is described. Casual statements like "we limited the fine-tuning time to three days" are insufficient to understand or reproduce the work. How was this criterion chosen? What learning rate was used, and which learning rates were experimented with (using a held-out validation test)? How were batches constructed? Was an early stopping criterion applied? Without such crucial details, it is difficult to evaluate the work. Indeed, it is entirely possible that the results reported are due to overfitting of the non-linear models.

(IV) Does the pre-training of the foundation models make them appropriate to the tasks at hand? The authors should describe how the foundation models (and Gears) were trained and whether the effects of genetic perturbations were included in the pretraining data. If the training data included genetic perturbations or some other source of relevant data, then it seems important to demonstrate that fine-tuning improves model predictions of unseen perturbations (i.e. generalization).

Competing interests

The authors declare that they have no competing interests.

Read the original source

Deep learning-based predictions of gene perturbation effects do not yet outperform simple linear baselines

This article has been Reviewed by the following groups

Listed in

Abstract

Contact

Article activity feed

Excerpt

Competing interests

Transfer Learning and Permutation-Invariance improving Predicting Genome-wide, Cell-Specific and Directional Interventions Effects of Complex Systems

Biologically-informed Interpretable Deep Learning Framework for Phenotype Prediction and Gene Interaction Detection

A scalable approach to investigating sequence-to-expression prediction from personal genomes

This article has been Reviewed by the following groups

Listed in

Abstract

Contact

Article activity feed

Excerpt

Competing interests

Related articles

Transfer Learning and Permutation-Invariance improving Predicting Genome-wide, Cell-Specific and Directional Interventions Effects of Complex Systems

Biologically-informed Interpretable Deep Learning Framework for Phenotype Prediction and Gene Interaction Detection

A scalable approach to investigating sequence-to-expression prediction from personal genomes