A clustering framework for skewed features with low true cluster separation

Muntazir Mehdi
Haydar Demirhan
Sona Taheri

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Features with considerably larger or smaller observations than the rest of the dataset, causing noticeable skewness in the feature distributions, are prevalent in practical applications. Traditional clustering methods often assume symmetric data, leading to poor performance with skewed features. This challenge becomes further complicated in datasets with low separation between true clusters in the feature space. These problems are encountered in a wide range of impor- tant practical areas, such as cell grouping, forest fires, maritime search and rescue, urbanization studies, and neuroimaging. Bayesian model-based clustering methods can accurately capture the skewness in the data and centers of poorly separated true clusters. However, they are computationally inefficient due to their Bayesian nature. We propose a Bayesian model-based clustering framework to address these issues by utilizing the generalized multivariate log-gamma dis- tribution with a Dirichlet process mixture. Comparative numerical experiments on 25 benchmark datasets with traditional and Bayesian model-based cluster- ing algorithms demonstrate the superior performance of the proposed method, particularly for skewed datasets with low true cluster separability. The proposed approach, implemented in R, also shows better computational efficiency than its Bayesian alternatives. The computer codes to implement our approach are provided to facilitate practical applications.

Version published to 10.21203/rs.3.rs-7787061/v1 on Research Square
Oct 7, 2025

Density Peaks Clustering Algorithm Based on Natural Neighbor and Multi-Cluster Merging Strategy

This article has 3 authors:
1. Fang Wan
2. Lili Wei
3. Chao Shi
This article has no evaluationsLatest version Dec 12, 2025
Classification of Bio-Data with Interval Dissimilarities: A Multidimensional Scaling Framework

This article has 4 authors:
1. Md. Anwarul Islam Bhuiyan
2. Sohana Jahan
3. Md. Babul Hasan
4. Md. Maruf Hossain
This article has no evaluationsLatest version Jan 21, 2026
DPDO:Dynamic Possion Disk Oversampling based on minority clusters within circular region for class imbalance problem

This article has 2 authors:
1. Runze Chen
2. Qiangkui Leng
This article has no evaluationsLatest version Jan 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Density Peaks Clustering Algorithm Based on Natural Neighbor and Multi-Cluster Merging Strategy

Classification of Bio-Data with Interval Dissimilarities: A Multidimensional Scaling Framework

DPDO:Dynamic Possion Disk Oversampling based on minority clusters within circular region for class imbalance problem