Best practices to cluster large molecular libraries

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

BitBIRCH is a novel clustering algorithm that enables the analysis of extremely large molecular libraries; however, its performance can be hindered by an excessive number of singletons or the formation of disproportionately large clusters. Here, we present a data-driven strategy to identify optimal BitBIRCH parameters that mitigate these limitations. Using the ChEMBL34 library as a case study (with additional datasets reported in the Supporting Information), we show that similarity thresholds between three and four standard deviations above the global mean provide a balanced trade-off between cluster count and medoid similarity. These values are efficiently approximate with the iSIM and iSIM-sigma frameworks. For the branching factor, values as high as computationally feasible are recommended, as increasing it to 1024 substantially reduced the number of singletons. We further introduce an iterative re-clustering procedure wherein the similarity threshold can be adjusted to merge related subclusters and singletons from the initial clustering, providing user-defined control over the extent of cluster fusion. This work provides practical guidelines to enhance the robustness and usability of BitBIRCH for large-scale molecular clustering.

Article activity feed