State-of-the-Art MPI Allreduce Implementations for Distributed Machine Learning: A Survey
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Efficient data communication is pivotal in distributed machine learning to manage the increased computational demands posed by large datasets and complex models. This survey explores the critical role of MPI Allreduce, a collective communication operation, in enhancing the scalability and performance of distributed machine learning. We examine traditional MPI libraries such as MPICH and Open MPI, which offer foundational support across diverse computing environments. Additionally, we delve into specialized implementations like NVIDIA’s NCCL and Intel’s oneCCL, designed to optimize performance on specific hardware platforms. Through a series of case studies, we demonstrate the impact of these optimized MPI Allreduce implementations on training times and model accuracy in real-world applications, such as large-scale image classification and natural language processing. Furthermore, we discuss emerging trends, including algorithmic advancements and hardware-specific optimizations, and the future direction towards automated tuning and integration with modern machine learning frameworks. This survey underscores the necessity of ongoing research and development in MPI Allreduce implementations to meet the evolving demands of distributed machine learning, highlighting its significance in achieving efficient, scalable, and robust distributed systems.