MoE at Scale: From Modular Design to Deployment in Large-Scale Machine Learning Systems

Aliyya Hadia
Tariq Afifa
Fawzi Gamal

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Mixture of Experts (MoE) architectures have rapidly emerged as a foundational building block for scaling deep neural networks efficiently, enabling models with hundreds of billions of parameters to be trained and deployed with only a fraction of their total capacity active per input. By conditionally activating a sparse subset of expert modules, MoEs decouple model capacity from computation cost, offering an elegant and powerful framework for modular representation learning. This survey provides a comprehensive and systematic review of the MoE literature, spanning early formulations in ensemble learning and hierarchical mixtures to modern sparse MoEs powering large-scale language and vision models. We categorize MoE architectures along dimensions of gating mechanisms, expert sparsity, hierarchical composition, and cross-domain generalization. Further, we examine core algorithmic components such as routing strategies, load balancing, training dynamics, expert specialization, and infrastructure-aware deployment. We explore their applications across natural language processing, computer vision, speech, and multi-modal learning, and highlight their impact on foundation model development. Despite their success, MoEs raise open challenges in routing stability, interpretability, dynamic capacity allocation, and continual learning, which we discuss in depth alongside emerging research directions including federated MoEs, compositional generalization, and neuro-symbolic expert modules. We conclude by identifying trends that point toward MoEs as a central abstraction for building efficient, modular, and general-purpose AI systems. This survey serves as both a foundational reference and a forward-looking roadmap for researchers and practitioners seeking to understand and advance the state of Mixture of Experts.

Version published to 10.20944/preprints202504.1313.v1
Apr 16, 2025

A Comprehensive Survey on Distributed Deep Learning Training: Parallelism Strategies, Frameworks, and Network Interconnects

This article has 6 authors:
1. Jiawei Xu
2. Chia Xin Liang
3. Ziqian Bi
4. Xiaoming Li
5. Danyang Zhang
6. Zhenyu Yu
This article has no evaluationsLatest version Dec 24, 2025
Architectural Diversity in Mixture of Experts: A Comparative Study

This article has 4 authors:
1. Yashkumar R. Lukhi
2. Harsh Rameshbhai Moradiya
3. Dmitry Ignatov
4. Radu Timofte
This article has no evaluationsLatest version Dec 11, 2025
Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods

This article has 5 authors:
1. Deepshikha Bhati
2. Fnu Neha
3. Devi Sri Bandaru
4. Matthew Weber
5. Ishan Dilipbhai Gajera
This article has no evaluationsLatest version Jan 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Comprehensive Survey on Distributed Deep Learning Training: Parallelism Strategies, Frameworks, and Network Interconnects

Architectural Diversity in Mixture of Experts: A Comparative Study

Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods