Evaluating Multimodal LLMs for Context-Aware Forensic Image Interpretation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Digital forensics is vital for analyzing extensive image data from mobile devices to identify individuals and activities in investigations. Traditional methods struggle with complex real-world images, particularly distinguishing military personnel from military-themed mannequins. This study assesses multimodal Large Language Models (LLMs) - Google’s Gemini 1.5 Pro, open-source LLAVA, and GPT-4o - for detecting military personnel in 434 mobile device images, including military personnel, mannequins, and civilians. The models achieved strong recall (0.99 for Gemini, 0.98 for LLAVA, and 0.91 for GPT-4o) but only moderate precision (0.69, 0.69, and 0.67 respectively), reflecting a notable rate of mannequin-induced false positives.] Accuracy varied from 0.793 for Gemini and LLAVA to 0.770 for GPT-4o, aligning with observed differences in contextual understanding. Contextual classification also posed challenges: Gemini achieved 0.787 accuracy for country identification, followed by GPT-4o (0.385) and LLAVA (0.121). Unit name recognition remained weak across models. Misclassification of mannequins was the primary source of error, confirming that current multimodal models overemphasize uniform and equipment cues without verifying human authenticity. To enhance interpretability and reduce false positives, we integrated an agentic orchestration layer using CrewAI and LangGraph, which structured multimodal reasoning through dedicated sub-agents for provenance validation, perception, mannequin discrimination, and evidence-grounded attribution. These agentic frameworks substantially improved forensic reliability: CrewAI achieved 0.88 precision with a mannequin false-positive rate of 0.12, while LangGraph reached 0.90 precision and reduced false positives to 0.08. Country attribution accuracy rose to 0.58 and 0.62 respectively. Although recall decreased slightly due to conservative abstention logic (CrewAI 0.74, LangGraph 0.73), this trade-off yielded higher forensic confidence and reproducible, audit-ready decision traces. The results demonstrate that integrating agentic architectures transforms multimodal LLMs from opaque classifiers into transparent, evidence-driven forensic tools - enhancing both analytic precision and the evidentiary defensibility of AI-assisted investigations.

Article activity feed