A clustering framework for skewed features with low true cluster separation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Features with considerably larger or smaller observations than the rest of the dataset, causing noticeable skewness in the feature distributions, are prevalent in practical applications. Traditional clustering methods often assume symmetric data, leading to poor performance with skewed features. This challenge becomes further complicated in datasets with low separation between true clusters in the feature space. These problems are encountered in a wide range of impor- tant practical areas, such as cell grouping, forest fires, maritime search and rescue, urbanization studies, and neuroimaging. Bayesian model-based clustering methods can accurately capture the skewness in the data and centers of poorly separated true clusters. However, they are computationally inefficient due to their Bayesian nature. We propose a Bayesian model-based clustering framework to address these issues by utilizing the generalized multivariate log-gamma dis- tribution with a Dirichlet process mixture. Comparative numerical experiments on 25 benchmark datasets with traditional and Bayesian model-based cluster- ing algorithms demonstrate the superior performance of the proposed method, particularly for skewed datasets with low true cluster separability. The proposed approach, implemented in R, also shows better computational efficiency than its Bayesian alternatives. The computer codes to implement our approach are provided to facilitate practical applications.

Article activity feed