Subcellular Localization Constrains Protein Detectability and Reveals Systematic RNA-Protein Discordance Across Cancers

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Transcript abundance is widely used as a proxy for protein expression in cancer studies; however, mRNA levels often fail to predict protein detectability due to post-transcriptional and compartment-specific regulatory processes. Here, we present a machine learning framework that integrates RNA expression, gene-level attributes, and subcellular localization to model protein detectability across human cancers.

Leveraging transcriptomic data from TCGA, TARGET, and GTEx, and protein annotations from the Human Protein Atlas, we constructed a dataset comprising over 100,000 gene–cancer pairs across seven tumor types. Models based on RNA features alone achieved moderate predictive performance (ROC-AUC ~0.71), whereas incorporating subcellular localization significantly improved accuracy (ROC-AUC ~0.82). Paired bootstrap analysis confirmed that these gains were statistically robust.

We further identify a substantial set of genes with high transcript abundance yet absent protein detection, revealing widespread RNA-protein decoupling. These discordant genes are enriched in mitochondrial, metabolic, and translational regulatory pathways, suggesting that discordance reflects structured biological processes rather than stochastic variation. Together, our results demonstrate that cellular context, particularly subcellular localization, is a key determinant of protein detectability and underscore the limitations of transcript-centric interpretations in cancer genomics.

Article activity feed

  1. This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/19416669.

    Write a short summary of the research's main findings and how this work has moved the field forward.

    Major issues

    • Reliability of protein detectability labels due to IHC/antibody limitations

    • Lack of orthogonal validation (e.g., Western blot / proteomics)

    • (Possibly) unclear cancer-specific generalization in figures.

    Minor issues

    • Clarity of feature importance visualization

    • Limited discussion of evaluation metrics beyond ROC curves

    Competing interests

    The author declares that they have no competing interests.

    Use of Artificial Intelligence (AI)

    The author declares that they used generative AI to come up with new ideas for their review.