Personalized Cancer Diagonisis Using Genetic Dataset and ML Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate genetic mutation classification is critical for precision medicine, but traditional models struggle with the complexity of unstructured textual data. This research addresses these challenges by transforming mutation descriptions into structured numerical representations, improving machine interpretability and model performance.Our approach utilizes TF-IDF vectorization to represent text data and applies Singular Value Decomposition (SVD) for dimensionality reduction, capturing essential information while reducing noise. To handle class imbalance, we employ SMOTE, enhancing the training dataset with synthetic minority samples. We further introduce a multi-level encoding strategy that combines statistical features with semantic word embeddings, enriching the feature set and capturing deeper patterns in the data.A range of machine learning models—SVM, Naïve Bayes, Random Forest, and KNN—are trained and optimized using GridSearchCV. Additionally, a Stacking Classifier integrates multiple models to boost predictive performance. Validation through Stratified K-Fold Cross-Validation ensures reliability and maintains balanced class distributions across folds.Our results show that structured feature encoding significantly improves classification accuracy over traditional methods. This work advances computational genomics by offering a robust solution for handling clinical text data, supporting more effective precision medicine initiatives.