Distributionally-Robust Gradient Routing: A Bilevel Sparse Optimization Problem for Compute-Aware Mixture-of-Experts Training

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Distributionally-Robust Gradient Routing (DRGR) is a bilevel sparse optimization framework for training compute-aware Mixture-of-Experts (MoE) models under domain uncertainty and strict resource budgets. DRGR jointly optimizes model parameters and sparse routing policies by minimizing worst-case generalization loss over an f-divergence ambiguity set while explicitly regularizing gradient-traffic and enforcing per-token and per-batch compute constraints. We derive a convex-concave dual reformulation of the inner adversary that yields stable low-dimensional optimization and closed-form adversarial weights, and we propose a proximal alternating minimization algorithm that combines group-sparse proximal updates with exact projection onto budget constraints. To address bilevel sensitivity we develop a Jacobian-free hypergradient estimator using Hessian-vector products implemented via conjugate-gradient or damped Neumann series; this estimator is amenable to distributed expert-parallel settings and is proven to guarantee descent on a natural merit function under controlled inexactness. We provide existence and stationarity guarantees, derive bounds linking routing sparsity and robustness radius to excess risk and communication cost, and propose an evaluation protocol measuring robustness-to-domain-shift, token-level fairness, and FLOPs/latency trade-offs. Empirical studies across synthetic and realistic multi-domain benchmarks demonstrate that DRGR substantially reduces worst-case error and gradient congestion while respecting strict compute budgets. Code and benchmark scripts are provided to facilitate reproducibility and wider adoption.

Article activity feed