Reimagining Model Efficiency in Generative AI Through Unified and Differentiable Quantization Approaches

Chand Aline
Mads Kristensen
Freja Thomsen
Lars Holm
Emilie Sondergaard

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

As generative artificial intelligence (GenAI) models, particularly large-scale autoregressive transformers, diffusion models, and multi-modal architectures, continue to grow in size and complexity, their immense computational and memory requirements pose substantial challenges to real-world deployment. Quantization, the process of reducing numerical precision of model parameters, activations, or gradients, has emerged as a critical tool to mitigate these challenges by enabling significant reductions in model size, inference latency, and energy consumption. However, the quantization of generative models introduces a uniquely complex set of obstacles that distinguish it from traditional applications in discriminative models. Unlike classifiers or object detectors, generative models must preserve semantic coherence, distributional fidelity, and high-dimensional output structure, all of which are highly sensitive to the perturbations introduced by low-precision representations. This review presents a comprehensive and technical examination of the current landscape of quantization in GenAI, spanning theoretical formulations, algorithmic advances, training strategies, hardware implications, and deployment scenarios. We begin by introducing the mathematical foundations of quantization, including uniform and non-uniform quantizers, rounding operations, scaling mechanisms, and optimization frameworks for minimizing quantization-induced distortion. We then survey a wide spectrum of quantization techniques applied to generative models, ranging from post-training quantization (PTQ) and quantization-aware training (QAT) to more advanced approaches such as learned codebooks, mixed-precision methods, and quantized attention mechanisms. We explore how these strategies are tailored for various generative tasks—text generation, image synthesis, speech modeling, and multi-modal understanding—and highlight the distinctive precision challenges posed by autoregressive decoding, cross-modal fusion, and latent variable modeling. Furthermore, we identify key limitations and failure modes, including instability during beam search, degradation of long-form generation, and inconsistencies between quantized and full-precision outputs. Through detailed analysis, we underscore the trade-offs between model efficiency and generative quality, and we discuss emerging solutions that aim to bridge this gap via adaptive quantization, quantization-friendly architectures, and hybrid numerical formats. The review also addresses the broader implications of quantization, including hardware-software co-design, evaluation metrics for quantized generative outputs, and fairness considerations in compressed model deployment. Finally, we outline a roadmap for future research, emphasizing the need for principled, scalable, and ethically responsible quantization methodologies that can support the growing demand for low-cost, high-performance generative AI across diverse platforms and applications. This work serves as both a technical resource and a strategic overview for researchers and practitioners seeking to harness quantization in the service of more efficient, accessible, and sustainable generative modeling.

Version published to 10.20944/preprints202508.1223.v1
Aug 19, 2025

Towards Robust and Scalable Mixture of Experts Architectures for Large Language and Vision Models

This article has 3 authors:
1. Aamina Yousra
2. Jumanah Fawziya
3. Fawzi Gamal
This article has no evaluationsLatest version Jul 2, 2025
Reimagining Efficiency in Vision-Language Models Through Low-Precision Training Across Modalities and Architectures

This article has 6 authors:
1. Beverley Marion
2. Rafael Kim
3. Amina Chowdhury
4. Julian E. Navarro
5. Lihua Zhang
6. Omar Farouk
This article has no evaluationsLatest version Aug 3, 2025
Practical Guidelines for Building Explainable, Efficient, and Robust Machine Learning Systems

This article has 2 authors:
1. Huan Zheng
2. Zhihao Ru
This article has no evaluationsLatest version Jun 29, 2025

Listed in

Abstract

Article activity feed

Related articles

Towards Robust and Scalable Mixture of Experts Architectures for Large Language and Vision Models

Reimagining Efficiency in Vision-Language Models Through Low-Precision Training Across Modalities and Architectures

Practical Guidelines for Building Explainable, Efficient, and Robust Machine Learning Systems