Route, Select, Activate: The Mechanics of Mixture of Experts
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Mixture of Experts (MoE) has emerged as a foundational architectural principle in deep learning, enabling the construction of models that are both highly expressive and computationally efficient. By activating only a sparse subset of specialized expert networks for each input, MoE architectures effectively decouple model capacity from compute cost, offering a scalable alternative to traditional monolithic models. This paradigm has seen a resurgence in recent years, with breakthroughs in large-scale language modeling, vision, speech, and multi-modal learning, culminating in some of the most powerful models to date, such as Switch Transformer, GLaM, and V-MoE. This survey presents a comprehensive and systematic overview of Mixture of Experts in the context of deep learning. We begin with a historical perspective on the origins of MoE, tracing its evolution from classical ensemble methods to modern, sparsely-activated architectures. The core design choices of MoE systems—including expert sparsity, gating functions, routing strategies, and hierarchical extensions—are carefully dissected. We examine a wide array of training challenges, such as expert imbalance, routing instability, and optimization bottlenecks, along with proposed solutions including load-balancing regularizers, top-$k$ routing approximations, expert dropout, and gradient routing relaxation. A central focus of this survey is the application landscape of MoE, spanning natural language processing, computer vision, speech and audio processing, multi-modal tasks, and continual learning. Across these domains, MoE has demonstrated compelling gains in scalability, generalization, and modular adaptability. In parallel, we explore the theoretical underpinnings of MoE, connecting it to concepts from ensemble learning, conditional computation, modularity, and universal function approximation. We also discuss emerging insights into MoE's generalization behavior, routing dynamics, and optimization landscape. Despite its promise, MoE remains an active and evolving field. We conclude by identifying open challenges—ranging from routing robustness and interpretability to dynamic expert generation and deployment constraints—and outlining directions for future research. In synthesizing current knowledge and charting future paths, this survey aims to serve as a definitive resource for both researchers and practitioners seeking to understand and harness the potential of Mixture of Experts in deep learning.