Subcellular Localization Constrains Protein Detectability and Reveals Systematic RNA-Protein Discordance Across Cancers

Kedar Joshi
Saniya Kate

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (PREreview)

Abstract

Transcript abundance is widely used as a proxy for protein expression in cancer studies; however, mRNA levels often fail to predict protein detectability due to post-transcriptional and compartment-specific regulatory processes. Here, we present a machine learning framework that integrates RNA expression, gene-level attributes, and subcellular localization to model protein detectability across human cancers.

Leveraging transcriptomic data from TCGA, TARGET, and GTEx, and protein annotations from the Human Protein Atlas, we constructed a dataset comprising over 100,000 gene–cancer pairs across seven tumor types. Models based on RNA features alone achieved moderate predictive performance (ROC-AUC ~0.71), whereas incorporating subcellular localization significantly improved accuracy (ROC-AUC ~0.82). Paired bootstrap analysis confirmed that these gains were statistically robust.

We further identify a substantial set of genes with high transcript abundance yet absent protein detection, revealing widespread RNA-protein decoupling. These discordant genes are enriched in mitochondrial, metabolic, and translational regulatory pathways, suggesting that discordance reflects structured biological processes rather than stochastic variation. Together, our results demonstrate that cellular context, particularly subcellular localization, is a key determinant of protein detectability and underscore the limitations of transcript-centric interpretations in cancer genomics.

PREreview
Apr 4, 2026
This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/19416669.

Write a short summary of the research's main findings and how this work has moved the field forward.

Major issues

Reliability of protein detectability labels due to IHC/antibody limitations
Lack of orthogonal validation (e.g., Western blot / proteomics)
(Possibly) unclear cancer-specific generalization in figures.

Minor issues

Clarity of feature importance visualization
Limited discussion of evaluation metrics beyond ROC curves

Competing interests

The author declares that they have no competing interests.

Use of Artificial Intelligence (AI)

The author declares that they used generative AI to come up with new ideas for their review.
Read the original source
Version published to 10.64898/2026.03.30.713919 on bioRxiv
Apr 1, 2026

Integrated transcriptomic and machine learning-driven analysis reveals high-confidence circular RNA biomarkers in Lung Adenocarcinoma

This article has 2 authors:
1. Ayushi Malviya
2. Rajabrata Bhuyan
This article has no evaluationsLatest version Feb 19, 2026
Large-scale proteome inference from unpaired single-cell transcriptomic and proteomic data by msInfer

This article has 9 authors:
1. Yadong Wang
2. Tianyi Zhao
3. Yuzhi Sun
4. Renjie Liu
5. Liyuan Zhang
6. Chengcheng Zhang
7. Yuran Jia
8. Liang Cheng
9. Guohua Wang
This article has no evaluationsLatest version Apr 2, 2026
Reference protein-coding transcripts of human genes annotated using long-read transcriptome datasets

This article has 2 authors:
1. Kuo-Feng Tung
2. Wen-chang Lin
This article has no evaluationsLatest version Mar 16, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Major issues

Minor issues

Competing interests

Use of Artificial Intelligence (AI)

Related articles

Integrated transcriptomic and machine learning-driven analysis reveals high-confidence circular RNA biomarkers in Lung Adenocarcinoma

Large-scale proteome inference from unpaired single-cell transcriptomic and proteomic data by msInfer

Reference protein-coding transcripts of human genes annotated using long-read transcriptome datasets