Learning with Fewer Bits Across Layers and Time in the Training of Foundation-Scale Transformers
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The unprecedented scale of contemporary foundation models has catalyzed a dramatic shift in both the capabilities and the computational demands of modern machine learning systems. While the performance benefits of large-scale architectures such as transformers are well-documented across a wide spectrum of domains—including natural language processing, computer vision, code synthesis, and multimodal reasoning—their resource consumption during training and deployment poses increasingly critical challenges. In response to these constraints, low-precision arithmetic has emerged not merely as a hardware optimization, but as a central algorithmic and architectural consideration for building scalable, sustainable, and accessible AI systems. In this work, we examine the frontier of low-precision training for large-scale neural networks, with a focus on how quantized representations, reduced numerical formats, and precision-aware optimizers interact with the unique demands of training foundation models. We explore how bit-level reductions in forward and backward computation affect convergence, stability, and generalization, particularly in the context of transformer-based architectures that dominate today’s state-of-the-art. Beyond empirical performance, we consider the theoretical and practical implications of quantized gradients, loss surface discretization, and the trade-offs introduced by aggressive precision constraints. Our analysis covers a broad range of methods, including mixed-precision training, dynamic loss scaling, 8-bit and 4-bit optimizer variants, quantization-aware initialization, and the role of master weights in mitigating numerical instability. We further discuss how precision can be dynamically allocated across layers and training phases, revealing new opportunities for adaptive learning systems that optimize both accuracy and efficiency. Finally, we address the broader system-level and ethical dimensions of low-precision training—ranging from hardware-software co-design and compiler-level integration to issues of robustness, fairness, and carbon footprint. By synthesizing these diverse threads, we argue that low-precision training represents a fundamental rethinking of the numerical foundations of deep learning, one that will be essential for the next generation of AI models that are not only larger and faster, but also more efficient, equitable, and environmentally viable.