Introduction of a Method to Choose the Number of Clusters of a Mixed Dataset under Kproto

Ahalya Sivathayalan
Kenneth Chu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Identifying homogenous subgroups within a dataset, particularly without prior knowledge of the number of such subgroups is of great interest to clinicians or policy makers as this knowledge aids them to tailor strategies to introduce interventions. Clustering is traditionally used to identify these subgroups. Partition clustering has been extensively used for its efficiency over other clustering methods, but it requires the number of clusters to be known in advance. While measures exist to estimate the number of clusters of a numeric dataset, no approaches exist in the literature to aid in selecting the number of clusters of a dataset that has both numeric and categorical variables. In this paper, we introduce and demonstrate a new method to select the number of clusters of a mixed dataset so that the clusters are stable, have maximized and stable categorical variable contribution using the extensively studied, Kproto clustering algorithm.

Version published to 10.21203/rs.3.rs-6795375/v1 on Research Square
Jun 3, 2025

A Novel Approach to Population Mean Estimation Using Two Auxiliary Variables Under PPS Sampling

This article has 3 authors:
1. Housila P. Singh
2. Rajesh Tailor
3. Akanksha Agrawal
This article has no evaluationsLatest version Jan 19, 2026
Classification of Bio-Data with Interval Dissimilarities: A Multidimensional Scaling Framework

This article has 4 authors:
1. Md. Anwarul Islam Bhuiyan
2. Sohana Jahan
3. Md. Babul Hasan
4. Md. Maruf Hossain
This article has no evaluationsLatest version Jan 21, 2026
Exploring the Relationship Between Doctors Availability and Mortality in Italy: A Machine Learning Approach using Multivariate Regression Trees

This article has 4 authors:
1. giulia contu
2. marco ortu
3. Francesco Mola
4. Sara Pau
This article has no evaluationsLatest version Jan 21, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Novel Approach to Population Mean Estimation Using Two Auxiliary Variables Under PPS Sampling

Classification of Bio-Data with Interval Dissimilarities: A Multidimensional Scaling Framework

Exploring the Relationship Between Doctors Availability and Mortality in Italy: A Machine Learning Approach using Multivariate Regression Trees