Predictive Modeling of Novel Somatic Mutation Impacts on Cancer Prognosis: A Machine Learning Approach Using the COSMIC Database

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Abstract Background Somatic mutations play a crucial role in cancer initiation, progression, and treatment response. While high-throughput sequencing has vastly expanded our understanding of cancer genomics, interpreting the functional impact of novel somatic mutations remains challenging. Machine learning approaches show promise in predicting mutation impacts, but robust models for accurate prognosis across different cancer types are still needed. Objective This study aimed to develop and validate a machine learning model using the Catalogue of Somatic Mutations in Cancer (COSMIC) database to predict the functional impact of novel somatic mutations on cancer prognosis across various cancer types. Methods We extracted data on 6,573,214 coding point mutations across 1,391 cancer types from COSMIC v95. We engineered 47 features for each mutation, including sequence context, protein domain information, evolutionary conservation scores, and frequency data. We developed and compared Random Forest, XGBoost, and Deep Neural Network models, selecting XGBoost based on performance. The model was evaluated using standard metrics and externally validated using data from The Cancer Genome Atlas (TCGA). Results The XGBoost model achieved an area under the Receiver Operating Characteristic curve (AUC-ROC) of 0.89 on the test set and 0.86 on the TCGA validation set. The model demonstrated consistent performance across major cancer types (AUC-ROC range: 0.85-0.92). Key predictive features included evolutionary conservation score, protein domain disruption, and mutation frequency. The model correctly identified 87% of known driver mutations and predicted 3,241 potentially high-impact novel mutations. Model predictions significantly correlated with patient survival in the TCGA dataset (HR = 1.8, 95% CI: 1.6-2.0, p < 0.001). Conclusions Our machine learning model shows strong predictive power in assessing the functional impact of somatic mutations on cancer prognosis across various cancer types. This approach has potential applications in research prioritization and clinical decision support, contributing to the advancement of precision oncology. Keywords cancer genomics; somatic mutations; machine learning; prognosis prediction; COSMIC database; precision oncology

Article activity feed