k-Means NANI: an improved clustering algorithm for Molecular Dynamics simulations

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

One of the key challenges of k -means clustering is the seed selection or the initial centroid estimation since the clustering result depends heavily on this choice. Alternatives such as k -means++ have mitigated this limitation by estimating the centroids using an empirical probability distribution. However, with high-dimensional and complex datasets such as those obtained from molecular simulation, k -means++ fails to partition the data in an optimal manner. Furthermore, stochastic elements in all flavors of k -means++ will lead to a lack of reproducibility. K -means N -Ary Natural Initiation (NANI) is presented as an alternative to tackle this challenge by using efficient n -ary comparisons to both identify high-density regions in the data and select a diverse set of initial conformations. Centroids generated from NANI are not only representative of the data and different from one another, helping k -means to partition the data accurately, but also deterministic, providing consistent cluster populations across replicates. From peptide and protein folding molecular simulations, NANI was able to create compact and well-separated clusters as well as accurately find the metastable states that agree with the literature. NANI can cluster diverse datasets and be used as a standalone tool or as part of our MDANCE clustering package.

Article activity feed