Large-scale Clustering via Fast Splitting of a Sparse Representative Tree Based on Local Density

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large-scale clustering remains an active yet challenging task in data mining and machine learning, where existing algorithms often struggle to balance efficiency, accuracy, and adaptability. This paper proposes a novel large-scale clustering framework with three key innovations: (1)Parameter-free cluster discovery: unlike conventional methods requiring predefined cluster numbers, our algorithm autonomously identifies natural cluster structures through dynamic density-based splitting decisions.(2)Hybrid sampling-partitioning strategy: by integrating randomized sampling with K-means-based partitioning, we extract high-quality representative points that preserve data integrity with linear computational complexity.(3)Local density-driven MST segmentation: A minimum spanning tree (MST) constructed from representatives is adaptively partitioned using a local density criterion, which dynamically disconnects weakly associated edges by comparing density peaks between adjacent representative points. Extensive experiments on synthetic and real-world data sets (up to 20 million samples) demonstrate the algorithm's superiority: it achieves higher clustering accuracy than state-of-the-art methods while reducing runtime. Notably, the framework exhibits remarkable robustness to sampling ratios and eliminates dependency on user-specified parameters, making it ideal for real-world applications with complex, arbitrary-shaped data distributions.

Article activity feed