BitBIRCH-Lean: chemical space in the palm of your workstation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We present BitBIRCH-Lean, a fast, memory-efficient implementation of the Bit-BIRCH algorithm, designed for high-throughput clustering of huge molecular libraries (up to billions of drug-like molecules) on typical workstations. BitBIRCH-Lean considerably improves on the original BitBIRCH implementation by incorporating dynamic types and bit-packed fingerprints inside the clustering tree. Most operations in BitBIRCH-Lean are efficiently performed on compressed data, and optional C++ extension accelerate the bottleneck calculations, providing up to 2X speedup. Benchmark tests against GPU-accelerated methods highlight BitBIRCH-Lean as an efficient alternative for processing vast amounts of molecules. We further demonstrate the versatility of this new package by showcasing a parallel, multi-round variant of the Bit- BIRCH algorithm that exploits the gains in efficiency to cluster hundreds of millions of molecules in minutes, with no loss in cluster quality. The code is freely available at: https://github.com/mqcomplab/bblean .