Optimizing Llama 7B for Medical Question Answering: A Study on Fine-Tuning Strategies and Performance on the MultiMedQA Dataset

Nuraini Sulaiman
Farizal Hamzah

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study explores the efficacy of various fine-tuning strategies on the performance of the Llama 7B model, a Large Language Model (LLM), when applied to the task of medical question answering using the MultiMedQA dataset. Through meticulous experimentation, we implemented and evaluated several fine-tuning techniques, including learning rate adjustments, graduated unfreezing, domain-specific vocabulary integration, selective layer fine-tuning, and regularization methods. Our findings reveal that these strategies significantly improve the model's accuracy, precision, recall, and F1 score, indicating a substantial enhancement in the model's ability to understand and respond to complex medical queries. The implications of these improvements are profound, suggesting the potential of fine-tuned LLMs to revolutionize medical information retrieval and support systems by providing more accurate, reliable, and contextually relevant answers. This research not only demonstrates the transformative power of fine-tuning LLMs for specialized applications but also lays the groundwork for future exploration into optimizing LLMs for a variety of domain-specific tasks. Our study contributes to the ongoing dialogue in the field of artificial intelligence and healthcare, highlighting the importance of targeted model optimization in achieving significant advancements in medical question answering.

Version published to 10.31219/osf.io/g5aes on OSF Preprints
Mar 23, 2024

Benchmarking Large Language Models on USMLE: Evaluating ChatGPT, DeepSeek, Grok, and Qwen in Clinical Reasoning and Medical Licensing Scenarios

This article has 7 authors:
1. Md Kamrul Siam
2. Angel Varela
3. Md Jobair Hossain Faruk
4. Jerry Q. Cheng
5. Huanying Gu
6. Abdullah Al Maruf
7. Zeyar Aung
This article has no evaluationsLatest version Jun 12, 2025
Performance of Large Language Artificial Intelligence Models on Clear Aligner Treatment: Evaluation of Accuracy and Readibility in Answering Questions for Patients

This article has 4 authors:
1. Osman Bahadır TOPCU
2. Güneş Kadriye Tiftikçi
3. Merve Aksöz
4. Furkan Dindaroğlu
This article has no evaluationsLatest version Jul 11, 2025
CLEVER: Clinical Large Language Model Evaluationby Expert Review

This article has 4 authors:
1. Veysel Kocaman
2. Mustafa Kaya
3. Andrei Ferrer
4. David Talby
This article has no evaluationsLatest version Jul 23, 2025

Listed in

Abstract

Article activity feed

Related articles

Benchmarking Large Language Models on USMLE: Evaluating ChatGPT, DeepSeek, Grok, and Qwen in Clinical Reasoning and Medical Licensing Scenarios

Performance of Large Language Artificial Intelligence Models on Clear Aligner Treatment: Evaluation of Accuracy and Readibility in Answering Questions for Patients

CLEVER: Clinical Large Language Model Evaluationby Expert Review