Well-Connected Community Detection at Extreme Scale: Shared- and Distributed-Memory Parallel Algorithms
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Community detection algorithms such as Leiden frequently produce clusters thatare internally disconnected or poorly connected, limiting their utility indownstream network analysis. The Well-Connected Clusters (WCC) and ConnectivityModifier (CM) algorithms address this by post-processing any input clusteringto enforce a user-defined edge connectivity criterion through recursive minimumcut bisection. While prior work demonstrated shared-memory parallelimplementations of WCC and CM in Chapel on graphs with up to two billion edges,scalability remains constrained by single-node memory capacity and the cost ofgraph loading and subgraph construction, which together account for over 86%of total runtime on billion-edge inputs.This paper presents distributed-memory parallel implementations of WCC and CMin both C++ with MPI and Chapel with multi-locale execution. The centralcontribution is an architectural redesign that integrates subgraph generationinto the Leiden clustering step, eliminating graph loading and subgraphconstruction from the WCC and CM pipeline entirely. Each compute node receivesonly its assigned subgraph files and executes a fully independent pipelinewithout ever loading the full graph. Connected component computation isparallelized within each node and distributed across nodes via round-robinassignment, and memory-mapped I/O accelerates file loading throughout.Experiments on ten real-world networks spanning up to 2.1 billion edges showthat the C++ distributed implementation achieves up to an order of magnitudespeedup over the original baseline on graphs where both complete successfully.The Chapel distributed implementation is integrated into Arachne, anopen-source graph analytics framework built on the Arkouda platform, availableat https://github.com/Bears-R-Us/arkouda-njit. It successfully processesthe full benchmark suite including graphs on which all other implementationsfail, and delivers consistent 1.2\((\times)\)--2.1\((\times)\) speedups over theChapel shared-memory reference. Failures on a subset of large graphs aretraced to a known limitation in the VieCut minimum cut library and are thesubject of ongoing work.