Unsupervised and Supervised Approaches for Breast Cancer Subtype Classification: Hierarchical Clustering and Machine Learning with Hyperparameter Optimization

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Breast cancer is considered a public health problem and a disease of concern, which contains distinct subtypes, making accurate classification critical for personalized treatment. This study proposes a hybrid approach by applying supervised and unsupervised learning techniques for breast cancer subtype classification using gene expression data from The Cancer Genome Atlas (TCGA). First, hierarchical clustering with Pearson correlation and Euclidean distance as similarity metrics are employed to explore the intrinsic structure of the dataset. Subsequently, supervised machine learning models, including Logistic Regression, Support Vector Machine (SVM), Random Forest, and Multilayer Perceptron (MLP), are trained for the classification task. Hyperparameter tuning is performed using Optuna to improve predictive performance, and SHapley Additive exPlanations (SHAP) is applied to analyze feature importance library was applied to analyze the importance of variables and the influence of each dimension for each classifier. The results highlight the effectiveness of applying clustering methods and machine learning to improve classification accuracy and interpretability, contributing to the development of more accurate diagnostic tools and to personalize treatment strategies in breast cancer.

Article activity feed