Speculative Decoding for Multimodal Models: A Survey

Yifan Zhang
Yuren Wang
Yunta Hsieh
Xin Wang
Ping Zhang
Ziyi Yang
Jianing Ma
Zesen Zhao
Boyuan Zheng
Hei Ting (Una) Chan
Jiarui Li
Xueshen Liu
Kunxiao Gao
Ruiyao Liu
Jingxuan Zhang
Junchen Li
Zhongwei Wan
Ziheng Zhang
Jing Xiong
Shatong Zhu
Hangrui Cao
Hui Shen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Multimodal generative models have demonstrated remarkable capabilities in visual understanding, audio synthesis, and embodied control. Such capabilities, however, come with substantial inference overhead due to autoregressive decoding or iterative generation processes, compounded by modality-specific challenges including extensive visual token redundancy, strict real-time latency in robotic control, and prolonged sequential generation in text-to-image synthesis. Speculative decoding has emerged as a promising paradigm to accelerate inference without degrading output quality, yet existing surveys remain focused on text-only large language models. In this survey, we provide a systematic and comprehensive review of speculative decoding methods for multimodal models, spanning Vision--Language, Text-to-Image, Vision--Language--Action, Video--Language, Speech, and Diffusion models. We organize the literature in a unified taxonomy consisting of two primary axes, covering the \emph{draft generation stage} and the \emph{verification and acceptance stage}, complemented by an analysis of inference framework support. Through this taxonomy, we identify recurring design patterns, including token compression, target-informed transfer, and relaxed acceptance, and examine how successful techniques transfer across modalities. We further provide a systematic comparison of existing methods under both self-reported and standardized benchmarking settings. Finally, we discuss open challenges and outline future directions. We have also created a GitHub repository where we organize the papers featured in this survey at https://github.com/zyfzs0/Multimodal-Models-Speculative-Decoding-Survey, and will actively maintain it to incorporate new research as it emerges. We hope this survey can serve as a valuable resource for researchers and practitioners working on accelerating multimodal inference.

Version published to 10.20944/preprints202603.2344.v1
Mar 30, 2026

Vision–Language Foundation Models and Multimodal Large Language Models: A Comprehensive Survey of Architectures, Benchmarks, and Open Challenges

This article has 5 authors:
1. Gurpreet Singh
2. Lamia Qamar
3. Nicholas Valentino Volta
4. Amruta Velamuri
5. Aya Khanyile
This article has no evaluationsLatest version Feb 9, 2026
Vision–Language Foundation Models and Multimodal Large Language Models: A Comprehensive Survey of Architectures, Benchmarks, and Open Challenges

This article has 5 authors:
1. Gurpreet Singh
2. Lamia Qamar
3. Nicholas Valentino Volta
4. Amruta Velamuri
5. Aya Khanyile
This article has no evaluationsLatest version Feb 9, 2026
TTV-HRM: Hierarchical Reasoning Architecture for Efficient Text-to-Video Generation

This article has 1 author:
1. Ahsan Umar
This article has no evaluationsLatest version Mar 23, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Vision–Language Foundation Models and Multimodal Large Language Models: A Comprehensive Survey of Architectures, Benchmarks, and Open Challenges

Vision–Language Foundation Models and Multimodal Large Language Models: A Comprehensive Survey of Architectures, Benchmarks, and Open Challenges

TTV-HRM: Hierarchical Reasoning Architecture for Efficient Text-to-Video Generation