Evaluation of a Multimodal Custom Finetuned LLM for Virtual Healthcare Consultations
Listed in
- @pranavupd's saved articles (pranavupd)
Abstract
We present a modular, privacy-conscious prototype for multimodal agency with retrieval-augmentedgeneration (RAG) for a virtual medical assistant in healthcare consultation. The system features a locallydeployed LLaMA 3.2 11B with 4-bit quantization to keep the model small yet efficient. The model directlyaccepts both images and text and has been fine-tuned using 50,000 image label pairs. The image label pairs aretaken from the MedTrinity dataset, which consists of a wide variety of medical-related image-text pairs. Themodel was fine-tuned to enhance multimodal query answering in medical contexts. Text, image, and speechinputs are all supported. Speech is transcribed via the Assembly AI transcription API. For retrieval-augmentedgeneration, ChromaDB semantically stores indexed medical documents sourced from the MedQuAD dataset,where 41,000 medicine-related question–answer pairs are stored.We evaluate the finetuned model by comparing it with the base model, both of which are compared with andwithout the support of Retrieval Augmented Generation (RAG). We assess the response via LLM as ajudgement criterion via OpenAI’s GPT-4.1. We use strict vs nonstrict evaluations of the model against theMMMU benchmark. For the MMMU dataset, we select the fields of basic medical science, clinical medicine,and diagnostic & laboratory medicine. Each field was evaluated with 30 questions per LLM variant with orwithout RAG support.