Unsupervised and Supervised Approaches for Breast Cancer Subtype Classification: Hierarchical Clustering and Machine Learning with Hyperparameter Optimization

Ana Beatriz Miranda Valentin
Glaucia Maria Bressan
Elisângela Lizzi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Breast cancer is considered a public health problem and a disease of concern, which contains distinct subtypes, making accurate classification critical for personalized treatment. This study proposes a hybrid approach by applying supervised and unsupervised learning techniques for breast cancer subtype classification using gene expression data from The Cancer Genome Atlas (TCGA). First, hierarchical clustering with Pearson correlation and Euclidean distance as similarity metrics are employed to explore the intrinsic structure of the dataset. Subsequently, supervised machine learning models, including Logistic Regression, Support Vector Machine (SVM), Random Forest, and Multilayer Perceptron (MLP), are trained for the classification task. Hyperparameter tuning is performed using Optuna to improve predictive performance, and SHapley Additive exPlanations (SHAP) is applied to analyze feature importance library was applied to analyze the importance of variables and the influence of each dimension for each classifier. The results highlight the effectiveness of applying clustering methods and machine learning to improve classification accuracy and interpretability, contributing to the development of more accurate diagnostic tools and to personalize treatment strategies in breast cancer.

Version published to 10.21203/rs.3.rs-6779819/v1 on Research Square
Nov 18, 2025

Pixels to Prognosis: ResNet50 Hyper-parameter Analysis for Predicting Benign vs. Malignant Breast Cancer from Biopsy Scans

This article has 1 author:
1. Naman Dhariwal
This article has no evaluationsLatest version Jan 27, 2026
Research on an Interpretable Grey Wolf Optimization-Based Ensemble Machine Learning Model for Identifying Heterogeneity of Bladder Cancer Based on Immunological Microenvironment

This article has 9 authors:
1. Honglin Guo
2. Qiuyue Song
3. Chengcheng Gao
4. Ke Chen
5. Yunhao Yang
6. Maoyang Qin
7. Pengyu Wang
8. Xin Chen
9. Yazhou Wu
This article has no evaluationsLatest version Dec 12, 2025
Smart Diagnosis: AI and ML Powered Breast Cancer Classification

This article has 2 authors:
1. Sagar Verma
2. Vaibhav Sabale
This article has no evaluationsLatest version Jan 28, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Pixels to Prognosis: ResNet50 Hyper-parameter Analysis for Predicting Benign vs. Malignant Breast Cancer from Biopsy Scans

Research on an Interpretable Grey Wolf Optimization-Based Ensemble Machine Learning Model for Identifying Heterogeneity of Bladder Cancer Based on Immunological Microenvironment

Smart Diagnosis: AI and ML Powered Breast Cancer Classification