Fine-tuned large language models for answering questions about full-text biomedical research studies

Kaiming Tao
Jinru Zhou
Zachary A. Osman
Vineet Ahluwalia
Chiara Sabatti
Robert W. Shafer

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objectives: Few studies have explored the degree to which fine-tuning a large-language model (LLM) can improve its ability to answer questions about a research study. Methods: We created an instruction set comprising 250 studies of HIV drug resistance, 16 questions per study, and answers plus explanations for each question. The questions included whether a study reported viral genetic sequences and the demographics and antiviral treatments of the persons from whom sequences were obtained. We fine-tuned GPT-4o-mini (GPT-4o), Llama3.1-8B-Instruct (Llama3.1-8B), and Llama3.1-70B-Instruct (Llama3.1-70B) using a quantized low rank adapter (QLoRA). We assessed the accuracy, precision, and recall of each base and fine-tuned model in answering questions on a test set comprising 120 different studies. Parametric and nonparametric tests were used to compare base models to one another, fine-tuned models to their respective base model, and fine-tuned models to one another. Results: Prior to fine-tuning, GPT-4o displayed significantly greater performance than both Llama3.1-70B and Llama3.1-8B due to its greater precision compared with Llama3.1-70B and greater precision and recall compared with Llama3.1-8B; there was no difference in performance between Llama3.1-70B and Llama3.1-8B. After fine-tuning, GPT-4o and Llama3.1-70B, but not Llama3.1-8B, displayed significantly improved performance compared with their base models. The improved performance of GPT-4o resulted from a mean 6% increased precision and 9% increased recall; the improved performance of Llama3.1-70B resulted from a 15% increased precision. After fine-tuning, Llama3.1-70B significantly outperformed Llama3.1-8B but did not perform as well as the fine-tuned GPT-4o model which displayed superior recall. Conclusion: Fine-tuning GPT-4o and Llama3.1-70B, but not the smaller Llama3.1-8B, led to marked improvement in answering specific questions about research papers.

Version published to 10.21203/rs.3.rs-6397312/v1 on Research Square
May 29, 2025

Benchmarking Large Language Models on USMLE: Evaluating ChatGPT, DeepSeek, Grok, and Qwen in Clinical Reasoning and Medical Licensing Scenarios

This article has 7 authors:
1. Md Kamrul Siam
2. Angel Varela
3. Md Jobair Hossain Faruk
4. Jerry Q. Cheng
5. Huanying Gu
6. Abdullah Al Maruf
7. Zeyar Aung
This article has no evaluationsLatest version Jun 12, 2025
From Rule-Based to DeepSeek R1 – A Robust Comparative Evaluation of Fifty Years of Natural Language Processing (NLP) Models To Identify Inflammatory Bowel Disease Cohorts

This article has 5 authors:
1. Matthew Stammers
2. Markus Gwiggner
3. Reza Nouraei
4. Cheryl Metcalf
5. James Batchelor
This article has no evaluationsLatest version Jul 7, 2025
Introducing Answered with Evidence - a framework for evaluating whether LLM responses to biomedical questions are founded in evidence

This article has 5 authors:
1. Julian D Baldwin
2. Christina Dinh
3. Arjun Mukerji
4. Neil Sanghavi
5. Saurabh Gombar
This article has no evaluationsLatest version Jul 2, 2025

Listed in

Abstract

Article activity feed

Related articles

Benchmarking Large Language Models on USMLE: Evaluating ChatGPT, DeepSeek, Grok, and Qwen in Clinical Reasoning and Medical Licensing Scenarios

From Rule-Based to DeepSeek R1 – A Robust Comparative Evaluation of Fifty Years of Natural Language Processing (NLP) Models To Identify Inflammatory Bowel Disease Cohorts

Introducing Answered with Evidence - a framework for evaluating whether LLM responses to biomedical questions are founded in evidence