An Empirical Evaluation of Low-Rank Adapted Vision–Language Models for Radiology Medical Image Captioning

Mahmudul Hoque
Raisa Nusrat Chowdhury
Md Rakibul Hasan
Ojonugwa Oluwafemi Ejiga Peter
Fahmi Khalifa
Md Mahmudur Rahman

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Rapidly growing medical imaging volumes have led to an increasing workload for radiologists, creating the need for automated tools that can support interpretation and reduce reporting delays. Vision-language models (VLMs) can generate clinically relevant captions to accelerate report drafting, but their varying parameter scales require evaluation for clinical utility. This study evaluated fine-tuned VLMs on the Radiology Objects in Context version 2 (ROCOv2) dataset containing 116,635 images across multiple modalities. We compared four Large VLMs (LVLMs) (LLaVA-Mistral-7B, LLaVA-Vicuna-7B, LLaVA-1.5-LLaMA-7B, IDEFICS-9B) against four Smaller VLMs (SVLMs) (MoonDream2, Qwen 2-VL, Qwen-2.5, SmolVLM) alongside two fully fine-tuned baseline architectures (VisionGPT2 and CNN-Transformer). Low-Rank Adaptation (LoRA), applied to fewer than 1% of selected model parameters, proved the optimal performance among adaptation strategies, outperforming broader LoRA configurations that we evaluated. LLaVA-Mistral-7B achieved the highest performance (Relevance: 0.516, Factuality: 0.118), substantially exceeding VisionGPT2 baseline (0.325, 0.028). Among SVLMs, MoonDream2 reached a relevance score of 0.466, surpassing LLaVA-1.5 (0.462) despite using approximately 74% fewer parameters. Models showed distinct performance hierarchy with LVLMs (0.273–0.317 overall), SVLMs (0.188–0.279), and baselines (0.154–0.177). To investigate performance enhancement strategies, we prepended ResNet-50-predicted image modality labels at inference time for underperforming SVLMs. This intervention produced variable results, with SmolVLM showing marginal improvement, Qwen-2.5 gaining 6.4%, but Qwen 2-VL experiencing a 21.6% performance reduction. Our results provide quantitative guidance for VLM selection in medical imaging. Although model size strongly influences performance, the findings indicate that architectural design and lightweight adaptation can enable select small models to achieve viable performance for resource-constrained screening scenarios.

Version published to 10.20944/preprints202510.1894.v1
Oct 24, 2025

Independent Benchmarking of Prompt-Based Medical Segmentation Models

This article has 8 authors:
1. Ayhan Can Erdur
2. Daniel Scholz
3. Josef A. Buchner
4. Denise Bernhardt
5. Stephanie E. Combs
6. Benedikt Wiestler
7. Daniel Rueckert
8. Jan C. Peeken
This article has no evaluationsLatest version Oct 10, 2025
Can Large Language Models Reliably Interpret Radiology Reports? A Systematic Evaluation for Tumor Progression Classification

This article has 8 authors:
1. Valentin POHYER
2. Constance de Margerie-Mellon
3. Laetitia PERRONNE
4. Loïc DURON
5. Constance THIBAULT
6. Stéphane Oudard
7. Laure FOURNIER
8. Bastien Rance
This article has no evaluationsLatest version Sep 23, 2025
SMILES Challenge 2025: Multitask Learning with Contrastive and Natural Language Generation for Enhanced Medical Image Classification

This article has 2 authors:
1. Raja Vavekanand
2. Teerath Kumar
This article has no evaluationsLatest version Oct 27, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Independent Benchmarking of Prompt-Based Medical Segmentation Models

Can Large Language Models Reliably Interpret Radiology Reports? A Systematic Evaluation for Tumor Progression Classification

SMILES Challenge 2025: Multitask Learning with Contrastive and Natural Language Generation for Enhanced Medical Image Classification