Instruction Tuning and CoT Prompting for Contextual Medical QA with LLMs

Chenqian Le
Ziheng Gong
Chihang Wang
Haowei Ni
Panfeng Li
Xupeng Chen

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) have shown greatpotential in medical question answering (MedQA), yet adaptingthem to biomedical reasoning remains challenging due to domainspecificcomplexity and limited supervision. In this work, westudy how prompt design and lightweight fine-tuning affect theperformance of open-source LLMs on PubMedQA, a benchmarkfor multiple-choice biomedical questions. We focus on twowidely used prompting strategies—standard instruction promptsand Chain-of-Thought (CoT) prompts—and apply QLoRA forparameter-efficient instruction tuning. Across multiple modelfamilies and sizes, our experiments show that CoT promptingalone can improve reasoning in zero-shot settings, while instructiontuning significantly boosts accuracy. However, fine-tuning onCoT prompts does not universally enhance performance and mayeven degrade it for certain larger models. These findings suggestthat reasoning-aware prompts are useful, but their benefits aremodel- and scale-dependent. Our study offers practical insightsinto combining prompt engineering with efficient finetuning formedical QA applications.

Version published to 10.20944/preprints202506.0254.v1
Jun 4, 2025

Enabling Doctor-Centric Medical AI with LLMs through Workflow-Aligned Tasks and Benchmarks

This article has 11 authors:
1. Wenya Xie
2. Qingying Xiao
3. Yu Zheng
4. Xidong Wang
5. Junying Chen
6. Ke Ji
7. Anningzhe Gao
8. Prayag Tiwari
9. Xiang Wan
10. Feng Jiang
11. Benyou Wang
This article has no evaluationsLatest version Jun 2, 2025
Evaluating a Large Reasoning Model’s Performance on Open-Ended Medical Scenarios

This article has 4 authors:
1. R. E. Hoyt
2. D. Knight
3. M. Haider
4. M. Bajwa
This article has no evaluationsLatest version Apr 30, 2025
Procedural Guideline Execution Training Improves LLM Performance in Rule-Based Clinical Tasks

This article has 1 author:
1. Pimchanok Boonmee
This article has no evaluationsLatest version Apr 22, 2025

Listed in

Abstract

Article activity feed

Related articles

Enabling Doctor-Centric Medical AI with LLMs through Workflow-Aligned Tasks and Benchmarks

Evaluating a Large Reasoning Model’s Performance on Open-Ended Medical Scenarios

Procedural Guideline Execution Training Improves LLM Performance in Rule-Based Clinical Tasks