Deep Learning-Based Genetic Perturbation Models Do Outperform Uninformative Baselines on Well-Calibrated Metrics

Henry E. Miller
Gabriel M. Mejia
Francis J. A. Leblanc
Bo Wang
Brendan Swain
Lucas Paulo de Lima Camillo

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Single cell genetic perturbation modeling involves predicting the effects of unobserved genetic manipulations, enabling scalable in silico screens for target discovery. Recent reports have claimed that deep learning-based perturbation models fail to outperform uninformative baselines, raising doubts about their utility. Here, we show that these conclusions largely stem from limitations of benchmarking metrics , not from the models themselves. We introduce a framework for evaluating bench-mark metric calibration using positive and negative controls, including a new positive control baseline (the interpolated duplicate ) and a quantitative calibration measure (the dynamic range fraction ). Across 14 perturbation datasets and 13 evaluation metrics, we find that conventional metrics such as mean squared error (MSE) and control-referenced delta correlation (Pearson(Δ _ctrl )) are often poorly calibrated, whereas weighted and rank-based alternatives exhibit consistent calibration. Under well-calibrated metrics, deep learning models outperform mean, control, and linear baselines, and in some cases even surpass the additive baseline in combination-prediction tasks. Calibrated evaluation thus explains prior reports of model underperformance, revealing that deep learning models do outperform uninformative baselines.

Version published to 10.1101/2025.10.20.683304 on bioRxiv
Oct 21, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed