Benchmarking Multimodal Large Language Models for Forensic Science and Medicine: A Comprehensive Dataset and Evaluation Framework

Ashmaan Sohail
Om M. Patel
Jihwan Choi
Jack C. S. Venditti
Addison J. Wu

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Multimodal large language models (MLLMs) have demonstrated substantial progress in medical and legal domains in recent years; however, their capabilities from the lens of forensic science—a field that is at the intersection of complex medical reasoning and legal interpretation, with conclusions critiqued by judicial scrutiny—remains largely unexplored. Forensic medicine uniquely depends on the accurate integration of often ambiguous text and visual information, yet systematic evaluations of MLLMs in this setting are lacking.

Methods

We conducted a comprehensive benchmarking study of eleven state-of-the-art MLLMs, including proprietary (GPT-4o, Claude 4 Sonnet, Gemini 2.5 Flash) and open-source (Llama 4, Qwen 2.5-VL) models. Models were evaluated on 847 examination-style forensic questions drawn from various academic literature, case studies, and clinical assessments, covering nine forensic subdomains. Both text-only and image-based questions were included. Model performance was assessed using direct and chain-of-thought prompting, with automated scoring verified through manual revision.

Results

Performance improved consistently with newer model generations. Chain-of-thought prompting improved accuracy on text-based and choice-based tasks for most models, though this trend did not hold for image-based and open-ended questions. Visual reasoning and complex inference tasks revealed persistent limitations, with models underperforming in image interpretation and nuanced forensic scenarios. Model performance remained stable across forensic subdomains, suggesting topic type alone did not drive variability.

Conclusions

MLLMs show emerging potential for forensic education and structured assessments, particularly for reinforcing factual knowledge. However, their limitations in visual reasoning, open-ended interpretation, and forensic judgment preclude independent application in live forensic practice. Future efforts should prioritize the development of multimodal forensic datasets, domain-targeted fine-tuning, and task-aware prompting to improve reliability and generalizability. These findings provide the first systematic baseline for MLLM performance in forensic science and inform pathways for their cautious integration into medico-legal workflows.

Version published to 10.1101/2025.07.06.25330972 on medRxiv
Jul 7, 2025

Benchmark Evaluation of Multi-Modal Large Language Models for Ophthalmic Diagnosis

This article has 10 authors:
1. Weihua Yang
2. Shoujun Huang
3. Junhong Chen
4. Jiaoman Wang
5. Ping Zhang
6. Wending Du
7. Yuan Hong
8. Dexing Kong
9. Wei Lou
10. Wei Chi
This article has no evaluationsLatest version Jul 23, 2025
A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

This article has 8 authors:
1. Hao Guan
2. Peter C. Hou
3. Pengyu Hong
4. Liqin Wang
5. Wenyu Zhang
6. Xinsong Du
7. Zhengyang Zhou
8. Li Zhou
This article has no evaluationsLatest version Jul 14, 2025
Zero-Shot Evaluation of Kimi K2 on Pediatric Clinical Cases

This article has 6 authors:
1. Gianluca Mondillo
2. Mariapia Masino
3. Simone Colosimo
4. Alessandra Perrotta
5. Vittoria Frattolillo
6. Fabio Giovanni Abbate
This article has no evaluationsLatest version Jul 29, 2025

Listed in

Abstract

Background

Methods

Results

Conclusions

Article activity feed

Related articles

Benchmark Evaluation of Multi-Modal Large Language Models for Ophthalmic Diagnosis

A Clinically-Informed Framework for Evaluating Vision-Language Models in Radiology Report Generation: Taxonomy of Errors and Risk-Aware Metric

Zero-Shot Evaluation of Kimi K2 on Pediatric Clinical Cases