Image and Video Question Answering with Large Language Models: A Comprehensive Review
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Image Question Answering (IQA) and Video Question Answering (VQA) are pivotal tasks at the intersection of computer vision and natural language processing, aiming to enable machines to comprehend visual content and respond to human questions in natural language. Historically, these fields have advanced through specialized architectures and feature engineering, but their ability to handle complex reasoning, open-ended generation, and real-world ambiguity has been limited. The advent of Large Language Models (LLMs) has fundamentally transformed the landscape of AI, showcasing unprecedented capabilities in language understanding, generation, and intricate reasoning. This survey provides a comprehensive review of the state-of-the-art in IQA and VQA, specifically focusing on how LLMs are being integrated and leveraged to push the boundaries of visual-linguistic intelligence. We delineate the foundational concepts of VQA/IQA and LLMs, categorize prominent architectural paradigms for their integration, scrutinize existing datasets, benchmarks, and evaluation metrics, and critically analyze the current challenges and promising future directions. Our review highlights the transformative potential of LLM-enhanced visual QA systems in overcoming limitations of traditional models, while also shedding light on emergent issues such as hallucinations, computational costs, and the need for robust evaluation. This work aims to serve as a structured guide for researchers navigating this rapidly evolving domain, fostering further innovation at the confluence of vision, language, and artificial intelligence.