Oncogene and Tumor Suppressor Gene Classification Using Protein Language Model Embeddings and a Novel Optimization Strategy

Ahmet Emir Şaşmazlar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background : Recent advances in protein language models (PLMs) and deep representation learning have enabled the creation of highly informative embeddings of protein sequences. Accurate classification of oncogenes and tumor suppressor genes (TSGs) from these embeddings has the potential to reveal biologically meaningful patterns relevant to cancer biology. Methods : We compiled a balanced dataset of 304 cancer-related human proteins, including both oncogenes and TSGs, mapped to UniProt IDs. Protein sequence embeddings were generated using ESM, and dimensionality reduction was performed with Principal Component Analysis (PCA). A novel neural network training pipeline was implemented using our custom optimizer, PCAGroupAdam. Its performance was benchmarked against traditional optimizers (Adam, SGD, RMSprop) and a classical Random Forest classifier. Model evaluation employed cross-validation, ROC-AUC analysis, confusion matrices, and advanced explainable AI (SHAP) techniques. Statistical comparisons between models were performed using paired t-tests. Results : The PCAGroupAdam-based neural network outperformed all baselines, achieving an accuracy of 0.66 and an ROC-AUC of 0.70 on the full dataset. SHAP analysis revealed that discriminative information was distributed across multiple embedding dimensions rather than concentrated in a single feature. Feature importance from both neural and tree-based models provided convergent insights. Statistical tests confirmed that PCAGroupAdam yielded significantly better performance than standard optimizers (p<0.05). Conclusions : Our results demonstrate that combining protein embeddings with a novel group-based optimizer provides improved classification of oncogenes and TSGs compared to standard approaches. The methodology is robust, reproducible, and extensible to larger protein datasets. This approach may contribute to a better understanding of cancer-related protein function and highlights the potential of explainable deep learning in computational oncology.

Version published to 10.21203/rs.3.rs-9066725/v1 on Research Square
Mar 11, 2026

GL-E2EATP: improving protein-ATP binding residue prediction using global and local embedding of protein language model

This article has 7 authors:
1. Bing Rao
2. Jie Bai
3. Maha A. Thafar
4. Somayah Albaradei
5. Kamran Arshad
6. Apilak Worachartcheewanh
7. Muhammad Arif
This article has no evaluationsLatest version Mar 26, 2026
Protein Function Prediction with Pretrained ProtT5 Embeddings and Gradient Boosting

This article has 2 authors:
1. Jett Appel
2. Nathan Butcher
This article has no evaluationsLatest version Apr 28, 2026
Pathway-based machine learning for breast cancer risk stratification: an interpretable framework validated in two independent cohorts

This article has 2 authors:
1. Suhaan Thayyil
2. Eshaan Nidee
This article has no evaluationsLatest version Apr 8, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

GL-E2EATP: improving protein-ATP binding residue prediction using global and local embedding of protein language model

Protein Function Prediction with Pretrained ProtT5 Embeddings and Gradient Boosting

Pathway-based machine learning for breast cancer risk stratification: an interpretable framework validated in two independent cohorts