State-of-the-Art MPI Allreduce Implementations for Distributed Machine Learning: A Survey

Niha Naineni

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Efficient data communication is pivotal in distributed machine learning to manage the increased computational demands posed by large datasets and complex models. This survey explores the critical role of MPI Allreduce, a collective communication operation, in enhancing the scalability and performance of distributed machine learning. We examine traditional MPI libraries such as MPICH and Open MPI, which offer foundational support across diverse computing environments. Additionally, we delve into specialized implementations like NVIDIA’s NCCL and Intel’s oneCCL, designed to optimize performance on specific hardware platforms. Through a series of case studies, we demonstrate the impact of these optimized MPI Allreduce implementations on training times and model accuracy in real-world applications, such as large-scale image classification and natural language processing. Furthermore, we discuss emerging trends, including algorithmic advancements and hardware-specific optimizations, and the future direction towards automated tuning and integration with modern machine learning frameworks. This survey underscores the necessity of ongoing research and development in MPI Allreduce implementations to meet the evolving demands of distributed machine learning, highlighting its significance in achieving efficient, scalable, and robust distributed systems.

Version published to 10.31219/osf.io/esm7q on OSF Preprints
Sep 24, 2024

Automated Deployment and Performance Benchmarking of Machine Learning Workloads on Hadoop Clusters Using Ansible

This article has 1 author:
1. Rameez Rahaman
This article has no evaluationsLatest version Jul 2, 2025
Efficient Deployment of a 685B-Parameter Open-Source LLM on the Brazilian Santos Dumont Supercomputer

This article has 13 authors:
1. Leon Sulfierry Corrêa Costa
2. Matheus Müller Pereira da Silva
3. Fábio Lima Custódio
4. José Renato Duarte Fajardo
5. Bruno Alves Fagundes
6. Marcelo Monteiro Galheigo
7. Vívian Medeiros
8. André Ramos Carneiro
9. Wagner Vieira Léo
10. Fábio André Machado Porto
11. Fábio Borges De Oliveira
12. Antônio Tadeu Azevedo Gomes
13. Laurent Emmanuel Dardenne
This article has no evaluationsLatest version May 14, 2025
Benchmarking Large Language Models for Data Pipeline Code Generation and Execution

This article has 4 authors:
1. Chiara Rucco
2. Motaz Saad
3. Tobia Martina
4. Antonella Longo
This article has no evaluationsLatest version Jul 2, 2025

Listed in

Abstract

Article activity feed

Related articles

Automated Deployment and Performance Benchmarking of Machine Learning Workloads on Hadoop Clusters Using Ansible

Efficient Deployment of a 685B-Parameter Open-Source LLM on the Brazilian Santos Dumont Supercomputer

Benchmarking Large Language Models for Data Pipeline Code Generation and Execution