Learning with Fewer Bits Across Layers and Time in the Training of Foundation-Scale Transformers

Oliver Hartley
Priya Desai
Nathaniel Brooks
Eleanor Hughes
Beverley Marion

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The unprecedented scale of contemporary foundation models has catalyzed a dramatic shift in both the capabilities and the computational demands of modern machine learning systems. While the performance benefits of large-scale architectures such as transformers are well-documented across a wide spectrum of domains—including natural language processing, computer vision, code synthesis, and multimodal reasoning—their resource consumption during training and deployment poses increasingly critical challenges. In response to these constraints, low-precision arithmetic has emerged not merely as a hardware optimization, but as a central algorithmic and architectural consideration for building scalable, sustainable, and accessible AI systems. In this work, we examine the frontier of low-precision training for large-scale neural networks, with a focus on how quantized representations, reduced numerical formats, and precision-aware optimizers interact with the unique demands of training foundation models. We explore how bit-level reductions in forward and backward computation affect convergence, stability, and generalization, particularly in the context of transformer-based architectures that dominate today’s state-of-the-art. Beyond empirical performance, we consider the theoretical and practical implications of quantized gradients, loss surface discretization, and the trade-offs introduced by aggressive precision constraints. Our analysis covers a broad range of methods, including mixed-precision training, dynamic loss scaling, 8-bit and 4-bit optimizer variants, quantization-aware initialization, and the role of master weights in mitigating numerical instability. We further discuss how precision can be dynamically allocated across layers and training phases, revealing new opportunities for adaptive learning systems that optimize both accuracy and efficiency. Finally, we address the broader system-level and ethical dimensions of low-precision training—ranging from hardware-software co-design and compiler-level integration to issues of robustness, fairness, and carbon footprint. By synthesizing these diverse threads, we argue that low-precision training represents a fundamental rethinking of the numerical foundations of deep learning, one that will be essential for the next generation of AI models that are not only larger and faster, but also more efficient, equitable, and environmentally viable.

Version published to 10.20944/preprints202509.0719.v1
Sep 22, 2025

Reimagining Model Efficiency in Generative AI Through Unified and Differentiable Quantization Approaches

This article has 5 authors:
1. Chand Aline
2. Mads Kristensen
3. Freja Thomsen
4. Lars Holm
5. Emilie Sondergaard
This article has no evaluationsLatest version Aug 19, 2025
Efficient Model Pruning for Large-Scale Deep Learning Models: Enhancing Performance and Reducing Computational Overhead

This article has 1 author:
1. Dinesh Kumar Koilada
This article has no evaluationsLatest version Sep 3, 2025
Optimizing Deep Learning Architectures forEnhanced Computational Efficiency

This article has 2 authors:
1. Ying Wang
2. Hui Li
This article has no evaluationsLatest version Sep 22, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Reimagining Model Efficiency in Generative AI Through Unified and Differentiable Quantization Approaches

Efficient Model Pruning for Large-Scale Deep Learning Models: Enhancing Performance and Reducing Computational Overhead

Optimizing Deep Learning Architectures forEnhanced Computational Efficiency