Fine-tuned large language models for answering questions about full-text biomedical research studies
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Objectives: Few studies have explored the degree to which fine-tuning a large-language model (LLM) can improve its ability to answer questions about a research study. Methods: We created an instruction set comprising 250 studies of HIV drug resistance, 16 questions per study, and answers plus explanations for each question. The questions included whether a study reported viral genetic sequences and the demographics and antiviral treatments of the persons from whom sequences were obtained. We fine-tuned GPT-4o-mini (GPT-4o), Llama3.1-8B-Instruct (Llama3.1-8B), and Llama3.1-70B-Instruct (Llama3.1-70B) using a quantized low rank adapter (QLoRA). We assessed the accuracy, precision, and recall of each base and fine-tuned model in answering questions on a test set comprising 120 different studies. Parametric and nonparametric tests were used to compare base models to one another, fine-tuned models to their respective base model, and fine-tuned models to one another. Results: Prior to fine-tuning, GPT-4o displayed significantly greater performance than both Llama3.1-70B and Llama3.1-8B due to its greater precision compared with Llama3.1-70B and greater precision and recall compared with Llama3.1-8B; there was no difference in performance between Llama3.1-70B and Llama3.1-8B. After fine-tuning, GPT-4o and Llama3.1-70B, but not Llama3.1-8B, displayed significantly improved performance compared with their base models. The improved performance of GPT-4o resulted from a mean 6% increased precision and 9% increased recall; the improved performance of Llama3.1-70B resulted from a 15% increased precision. After fine-tuning, Llama3.1-70B significantly outperformed Llama3.1-8B but did not perform as well as the fine-tuned GPT-4o model which displayed superior recall. Conclusion: Fine-tuning GPT-4o and Llama3.1-70B, but not the smaller Llama3.1-8B, led to marked improvement in answering specific questions about research papers.