Using The Cancer Genome Atlas from cBioPortal to Develop Genomic Datasets for Machine Learning Assisted Cancer Treatment
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Predicting the impact of genetic mutations is crucial for understanding diseases like cancer. Polymorphism Phenotyping (PolyPhen) and Sorting Intolerant From Tolerant (SIFT) are key tools for assessing how amino acid substitutions affect protein function and mutation pathogenicity. To our knowledge, no ready-to-use genomic dataset exists for prediction models to identify potentially harmful mutations, which could support research and clinical decisions. This study develops genomic and non-genomic datasets using The Cancer Genome Atlas (TCGA) from cBioPortal and applies machine learning models to predict PolyPhen and SIFT scores. We explore three classification models: Random Forest (RF), Extreme Gradient Boosting (XGBoost), and an ensemble RF-XGBoost model. Experimental results show that genomic data yields more accurate predictions than non-genomic data. The ensemble RF-XGBoost model performs best on genomic data, achieving average accuracies of 88.43% for PolyPhen and 95.13% for SIFT, highlighting the potential of artificial intelligence in genetic mutation analysis for disease treatment.