Oncogene and Tumor Suppressor Gene Classification Using Protein Language Model Embeddings and a Novel Optimization Strategy
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background : Recent advances in protein language models (PLMs) and deep representation learning have enabled the creation of highly informative embeddings of protein sequences. Accurate classification of oncogenes and tumor suppressor genes (TSGs) from these embeddings has the potential to reveal biologically meaningful patterns relevant to cancer biology. Methods : We compiled a balanced dataset of 304 cancer-related human proteins, including both oncogenes and TSGs, mapped to UniProt IDs. Protein sequence embeddings were generated using ESM, and dimensionality reduction was performed with Principal Component Analysis (PCA). A novel neural network training pipeline was implemented using our custom optimizer, PCAGroupAdam. Its performance was benchmarked against traditional optimizers (Adam, SGD, RMSprop) and a classical Random Forest classifier. Model evaluation employed cross-validation, ROC-AUC analysis, confusion matrices, and advanced explainable AI (SHAP) techniques. Statistical comparisons between models were performed using paired t-tests. Results : The PCAGroupAdam-based neural network outperformed all baselines, achieving an accuracy of 0.66 and an ROC-AUC of 0.70 on the full dataset. SHAP analysis revealed that discriminative information was distributed across multiple embedding dimensions rather than concentrated in a single feature. Feature importance from both neural and tree-based models provided convergent insights. Statistical tests confirmed that PCAGroupAdam yielded significantly better performance than standard optimizers (p<0.05). Conclusions : Our results demonstrate that combining protein embeddings with a novel group-based optimizer provides improved classification of oncogenes and TSGs compared to standard approaches. The methodology is robust, reproducible, and extensible to larger protein datasets. This approach may contribute to a better understanding of cancer-related protein function and highlights the potential of explainable deep learning in computational oncology.