MoE at Scale: From Modular Design to Deployment in Large-Scale Machine Learning Systems

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Mixture of Experts (MoE) architectures have rapidly emerged as a foundational building block for scaling deep neural networks efficiently, enabling models with hundreds of billions of parameters to be trained and deployed with only a fraction of their total capacity active per input. By conditionally activating a sparse subset of expert modules, MoEs decouple model capacity from computation cost, offering an elegant and powerful framework for modular representation learning. This survey provides a comprehensive and systematic review of the MoE literature, spanning early formulations in ensemble learning and hierarchical mixtures to modern sparse MoEs powering large-scale language and vision models. We categorize MoE architectures along dimensions of gating mechanisms, expert sparsity, hierarchical composition, and cross-domain generalization. Further, we examine core algorithmic components such as routing strategies, load balancing, training dynamics, expert specialization, and infrastructure-aware deployment. We explore their applications across natural language processing, computer vision, speech, and multi-modal learning, and highlight their impact on foundation model development. Despite their success, MoEs raise open challenges in routing stability, interpretability, dynamic capacity allocation, and continual learning, which we discuss in depth alongside emerging research directions including federated MoEs, compositional generalization, and neuro-symbolic expert modules. We conclude by identifying trends that point toward MoEs as a central abstraction for building efficient, modular, and general-purpose AI systems. This survey serves as both a foundational reference and a forward-looking roadmap for researchers and practitioners seeking to understand and advance the state of Mixture of Experts.

Article activity feed