Training Strategy Optimization to Mitigate Shortcut Learning in Pan-Cancer Drug Response Prediction

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background

Prediction of in vivo drug response is a central challenge in precision medicine, but the scarcity of labeled clinical data still necessitates the use of large-scale cancer cell line resources for model training. Domain adaptation methods, which aim to transfer knowledge learned from a source domain (cell lines) to a target domain (patients) by aligning feature distributions across domains, are a promising approach to bridge the gap between in vitro models and in vivo patients. However, we observed that these methods can exhibit a significant discrepancy between pan-cancer evaluation metrics and cancer type-specific prediction accuracy. This performance gap warrants a detailed investigation into their underlying predictive characteristics.

Results

We discovered that cancer-type-specific class imbalances in training data can lead domain adaptation models to engage in shortcut learning, where they primarily discriminate between cancer types rather than capturing the actual biological determinants of drug sensitivity. To address this, we propose a strategy of combining two approaches: (1) excluding cancer types causing imbalance from the training data, and (2) adjusting class balance through oversampling and class weighting while retaining cancer types causing the imbalance. Among all configurations tested in conjunction with the CODE-AE (Context-aware Deconfounding AutoEncoder) framework, the combination of moderate oversampling (30% non-responder ratio) with class weighting achieved the best performance, significantly improving prediction accuracy in 5 out of 11 external patient cohorts from TCGA and GEO.

Conclusions

Our findings demonstrate that appropriate class imbalance correction—rather than wholesale exclusion of imbalanced cancer subtypes—enables effective utilization of biologically relevant information shared across cancer types for drug response prediction. This study highlights the critical importance of jointly optimizing training data composition and class balance adjustment strategies in developing robust pan-cancer drug response prediction models for precision medicine applications.

Highlights

  • Identified a critical discrepancy in current domain adaptation models for drug response prediction: high pan-cancer accuracy often masks poor performance within specific cancer types.

  • Revealed the root cause as “shortcut learning,” where models tend to distinguish between cancer tissue types (hematological vs. solid) rather than learning individual drug sensitivity.

  • Discovered severe class imbalance in training data, with hematological cell lines being disproportionately drug-responsive across multiple chemotherapeutics.

  • Proposed an architecture-agnostic fix using the CODE-AE framework: moderate oversampling (30% minority ratio) combined with class weighting.

  • Demonstrated significant improvements in 5 of 11 external patient cohorts, showing that correcting class bias is more effective than simply excluding problematic data.

Article activity feed