Reimagining Efficiency in Vision-Language Models Through Low-Precision Training Across Modalities and Architectures

Beverley Marion
Rafael Kim
Amina Chowdhury
Julian E. Navarro
Lihua Zhang
Omar Farouk

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

As Vision-Language Models (VLMs) grow increasingly large and sophisticated, they face growing demands in memory, computation, and energy consumption, limiting their accessibility and sustainability. This has led to an urgent need for more efficient training and inference techniques, among which low-precision training has emerged as a particularly promising paradigm. Low-precision training refers to representing and computing model parameters, activations, and gradients using reduced-bit formats (e.g., 8-bit, 4-bit, or even binary), rather than the conventional 32-bit floating-point representation. This approach offers significant reductions in memory footprint, bandwidth requirements, and computational cost, enabling faster training cycles, larger batch sizes, and more affordable hardware deployment. However, applying low-precision training to VLMs presents a unique set of challenges due to the multimodal nature of these models, which combine visual and textual inputs and often rely on complex attention mechanisms and cross-modal fusion modules. These components are sensitive to numerical precision, and naive quantization can lead to unstable training dynamics, misalignment between modalities, and substantial accuracy degradation. This survey provides a comprehensive and detailed overview of the current landscape of low-precision training for large VLMs, synthesizing recent advances across algorithmic, architectural, and empirical dimensions. We begin with the mathematical formulation of quantized training, introducing the key concepts of discretization, straight-through estimators, and quantization-aware optimization. We then review a wide range of quantization techniques, including post-training quantization (PTQ), quantization-aware training (QAT), mixed-precision strategies, learned step size quantization (LSQ), and state-of-the-art post-hoc methods such as GPTQ and AWQ. Each method is analyzed in terms of its applicability to VLM components, such as vision encoders, language models, and multimodal fusion layers. We also introduce quantization-aware architectural designs, illustrated with schematic diagrams, that show how precision can be allocated strategically across different model stages to balance efficiency and accuracy. Empirical evaluations are extensively discussed, highlighting how low-precision training affects model performance on key vision-language tasks like visual question answering (VQA), image-text retrieval, and image captioning. We compare accuracy retention, memory savings, and training throughput under different quantization regimes, and present case studies that reveal the design decisions behind successful quantized VLMs such as BLIP-2, Flamingo, and MiniGPT. Despite the progress made, several open problems remain, including the lack of theoretical tools to predict quantization sensitivity, the difficulty of stabilizing training in ultra-low-bit settings, the underdevelopment of hardware-software ecosystems for low-precision training, and the insufficiency of current evaluation metrics in capturing multimodal fidelity and alignment under quantization. We discuss these challenges in depth and propose a roadmap for future research, emphasizing the importance of quantization-native architectures, robust training algorithms, hardware-aligned design, and new benchmarks that reflect the nuanced requirements of VLMs. We argue that low-precision training is not merely a technical optimization, but a foundational shift that enables more scalable, sustainable, and inclusive multimodal AI. As VLMs continue to scale and move toward broader deployment in real-world applications—from interactive assistants to mobile devices and embedded systems—low-precision methods will be critical to ensuring that these systems are not only powerful, but also efficient and widely accessible. This survey aims to serve as both a reference and a call to action for researchers, practitioners, and system designers working at the forefront of efficient vision-language learning.

Version published to 10.31224/5001
Aug 3, 2025

Reimagining Model Efficiency in Generative AI Through Unified and Differentiable Quantization Approaches

This article has 5 authors:
1. Chand Aline
2. Mads Kristensen
3. Freja Thomsen
4. Lars Holm
5. Emilie Sondergaard
This article has no evaluationsLatest version Aug 19, 2025
Revisiting Convolutional Design for Efficient CNNs: An Empirical Study on Embedded AI Platforms

This article has 1 author:
1. Onur Erdem Korkmaz
This article has no evaluationsLatest version Aug 25, 2025
Efficient Inference of Large Language Models through Model Compression

This article has 4 authors:
1. James Whitmore
2. Clara Hastings
3. Amir Patel
4. Stephany Brody
This article has no evaluationsLatest version Aug 5, 2025

Listed in

Abstract

Article activity feed

Related articles

Reimagining Model Efficiency in Generative AI Through Unified and Differentiable Quantization Approaches

Revisiting Convolutional Design for Efficient CNNs: An Empirical Study on Embedded AI Platforms

Efficient Inference of Large Language Models through Model Compression