Introduction of a Method to Choose the Number of Clusters of a Mixed Dataset under Kproto

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Identifying homogenous subgroups within a dataset, particularly without prior knowledge of the number of such subgroups is of great interest to clinicians or policy makers as this knowledge aids them to tailor strategies to introduce interventions. Clustering is traditionally used to identify these subgroups. Partition clustering has been extensively used for its efficiency over other clustering methods, but it requires the number of clusters to be known in advance. While measures exist to estimate the number of clusters of a numeric dataset, no approaches exist in the literature to aid in selecting the number of clusters of a dataset that has both numeric and categorical variables. In this paper, we introduce and demonstrate a new method to select the number of clusters of a mixed dataset so that the clusters are stable, have maximized and stable categorical variable contribution using the extensively studied, Kproto clustering algorithm.

Article activity feed