An Empirical Evaluation of Low-Rank Adapted Vision–Language Models for Radiology Medical Image Captioning
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Rapidly growing medical imaging volumes have led to an increasing workload for radiologists, creating the need for automated tools that can support interpretation and reduce reporting delays. Vision-language models (VLMs) can generate clinically relevant captions to accelerate report drafting, but their varying parameter scales require evaluation for clinical utility. This study evaluated fine-tuned VLMs on the Radiology Objects in Context version 2 (ROCOv2) dataset containing 116,635 images across multiple modalities. We compared four Large VLMs (LVLMs) (LLaVA-Mistral-7B, LLaVA-Vicuna-7B, LLaVA-1.5-LLaMA-7B, IDEFICS-9B) against four Smaller VLMs (SVLMs) (MoonDream2, Qwen 2-VL, Qwen-2.5, SmolVLM) alongside two fully fine-tuned baseline architectures (VisionGPT2 and CNN-Transformer). Low-Rank Adaptation (LoRA), applied to fewer than 1% of selected model parameters, proved the optimal performance among adaptation strategies, outperforming broader LoRA configurations that we evaluated. LLaVA-Mistral-7B achieved the highest performance (Relevance: 0.516, Factuality: 0.118), substantially exceeding VisionGPT2 baseline (0.325, 0.028). Among SVLMs, MoonDream2 reached a relevance score of 0.466, surpassing LLaVA-1.5 (0.462) despite using approximately 74% fewer parameters. Models showed distinct performance hierarchy with LVLMs (0.273–0.317 overall), SVLMs (0.188–0.279), and baselines (0.154–0.177). To investigate performance enhancement strategies, we prepended ResNet-50-predicted image modality labels at inference time for underperforming SVLMs. This intervention produced variable results, with SmolVLM showing marginal improvement, Qwen-2.5 gaining 6.4%, but Qwen 2-VL experiencing a 21.6% performance reduction. Our results provide quantitative guidance for VLM selection in medical imaging. Although model size strongly influences performance, the findings indicate that architectural design and lightweight adaptation can enable select small models to achieve viable performance for resource-constrained screening scenarios.