Multiple Large-Language-Models Consensus for Object Detection—A Survey

Marcin Iwanowski
Marcin Gahbler

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid development of large language models (LLMs) and vision–language models (VLMs) has enabled instruction-driven visual understanding, where a single foundation model can recognize and localize arbitrary objects from natural-language prompts. However, predictions from individual models remain inconsistent – LLMs hallucinate nonexistent entities, while VLMs exhibit limited recall and unstable calibration compared to purpose-trained detectors. To address these limitations, a new paradigm termed Multiple Large–Language–Model Consensus (Multi-LLM Consensus) has emerged. In this approach, multiple heterogeneous LLMs or VLMs process a shared visual–textual instruction, generate independent structured outputs (bounding boxes and categories). Next, their results are merged through consensus mechanisms. This cooperative inference improves spatial accuracy and semantic correctness, making it particularly suitable for generating high-quality training datasets for fast real-time object detectors. This survey provides a comprehensive overview of Multi-LLM Consensus for object detection. We formalize the concept, review related literature on ensemble reasoning and multimodal perception, and categorize existing methods into four frameworks: prompt-level, reasoning-to-detection, box-level, and hybrid consensus. We further analyze fusion algorithms, evaluation metrics, and benchmark datasets, highlighting their strengths and limitations. Finally, we discuss open challenges—vocabulary alignment, uncertainty calibration, computational efficiency, and bias propagation—and identify emerging trends such as consensus-aware training, structured reasoning, and collaborative perception ecosystems.

Version published to 10.20944/preprints202511.0879.v1
Nov 13, 2025

Object Tracking: A Comprehensive Survey From Classical Approaches to Large Vision-Language and Foundation Models

This article has 4 authors:
1. Rahul Raja
2. Arpita Vats
3. Omkar Thawakar
4. Tajamul Ashraf
This article has no evaluationsLatest version Oct 31, 2025
Object Tracking: A Comprehensive Survey From Classical Approaches to Large Vision-Language and Foundation Models

This article has 4 authors:
1. Rahul Raja
2. Arpita Vats
3. Omkar Thawakar
4. Tajamul Ashraf
This article has no evaluationsLatest version Oct 31, 2025
A Review of Multimodal Vision-Language Models: Foundations, Applications, and Future Directions

This article has 1 author:
1. Gurpreet Singh
This article has no evaluationsLatest version Nov 3, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Object Tracking: A Comprehensive Survey From Classical Approaches to Large Vision-Language and Foundation Models

Object Tracking: A Comprehensive Survey From Classical Approaches to Large Vision-Language and Foundation Models

A Review of Multimodal Vision-Language Models: Foundations, Applications, and Future Directions