Reimagining Efficiency in Vision-Language Models Through Low-Precision Training Across Modalities and Architectures
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
As Vision-Language Models (VLMs) grow increasingly large and sophisticated, they face growing demands in memory, computation, and energy consumption, limiting their accessibility and sustainability. This has led to an urgent need for more efficient training and inference techniques, among which low-precision training has emerged as a particularly promising paradigm. Low-precision training refers to representing and computing model parameters, activations, and gradients using reduced-bit formats (e.g., 8-bit, 4-bit, or even binary), rather than the conventional 32-bit floating-point representation. This approach offers significant reductions in memory footprint, bandwidth requirements, and computational cost, enabling faster training cycles, larger batch sizes, and more affordable hardware deployment. However, applying low-precision training to VLMs presents a unique set of challenges due to the multimodal nature of these models, which combine visual and textual inputs and often rely on complex attention mechanisms and cross-modal fusion modules. These components are sensitive to numerical precision, and naive quantization can lead to unstable training dynamics, misalignment between modalities, and substantial accuracy degradation. This survey provides a comprehensive and detailed overview of the current landscape of low-precision training for large VLMs, synthesizing recent advances across algorithmic, architectural, and empirical dimensions. We begin with the mathematical formulation of quantized training, introducing the key concepts of discretization, straight-through estimators, and quantization-aware optimization. We then review a wide range of quantization techniques, including post-training quantization (PTQ), quantization-aware training (QAT), mixed-precision strategies, learned step size quantization (LSQ), and state-of-the-art post-hoc methods such as GPTQ and AWQ. Each method is analyzed in terms of its applicability to VLM components, such as vision encoders, language models, and multimodal fusion layers. We also introduce quantization-aware architectural designs, illustrated with schematic diagrams, that show how precision can be allocated strategically across different model stages to balance efficiency and accuracy. Empirical evaluations are extensively discussed, highlighting how low-precision training affects model performance on key vision-language tasks like visual question answering (VQA), image-text retrieval, and image captioning. We compare accuracy retention, memory savings, and training throughput under different quantization regimes, and present case studies that reveal the design decisions behind successful quantized VLMs such as BLIP-2, Flamingo, and MiniGPT. Despite the progress made, several open problems remain, including the lack of theoretical tools to predict quantization sensitivity, the difficulty of stabilizing training in ultra-low-bit settings, the underdevelopment of hardware-software ecosystems for low-precision training, and the insufficiency of current evaluation metrics in capturing multimodal fidelity and alignment under quantization. We discuss these challenges in depth and propose a roadmap for future research, emphasizing the importance of quantization-native architectures, robust training algorithms, hardware-aligned design, and new benchmarks that reflect the nuanced requirements of VLMs. We argue that low-precision training is not merely a technical optimization, but a foundational shift that enables more scalable, sustainable, and inclusive multimodal AI. As VLMs continue to scale and move toward broader deployment in real-world applications—from interactive assistants to mobile devices and embedded systems—low-precision methods will be critical to ensuring that these systems are not only powerful, but also efficient and widely accessible. This survey aims to serve as both a reference and a call to action for researchers, practitioners, and system designers working at the forefront of efficient vision-language learning.