For clinical data extraction, QLoRA attains accuracy close to LoRA while requiring lower compute resources

Prabin R. Shakya
Ayush Khaneja
Kavishwar B. Wagholikar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Large language models (LLMs) can accurately extract structured data from free text, yet fine-tuning for specific clinical tasks is often compute- and memory-intensive. We examine whether Parameter-Efficient Fine-Tuning (PEFT)—updating only a small subset of weights in the LLM— preserves accuracy on quantized models while further reducing memory and graphical processing unit (GPU) requirements for hardware-limited teams.

Methods

We fine-tuned three Llama-3.1-8B-Instruct variants: (i) a non-quantized low-rank adaptation (LoRA) model and (ii–iii) quantized low-rank adaptation (QLoRA) models initialized from 8-bit and 4-bit quantized bases. We used the ELMTEX corpus of 60,000 clinical summaries extracted from PubMed Central, which included manual annotations for 15 categories. Models were evaluated with naïve and advanced prompting to extract data from the corpus for the 15 categories. Advanced prompting involved a detailed task description and three examples selected using similarity scores. Metrics included ROUGE and BERTScore for lexical/semantic alignment, and entity-level precision, recall, and F1 to assess clinical concept extraction.

Results

Fine-tuning consistently outperformed prompting alone. LoRA improved metrics by 10–20 points over the base model, while QLoRA improved by 8–14 points—only 2–4 points below LoRA. Quantization lowered the need for resources— LoRA required 4 GPUs, versus 3 (8-bit) and 2 (4-bit) for QLoRA. Compared with LoRA, 4-bit QLoRA used about two-thirds of the peak GPU RAM. However training of quantized models took 28–32% longer, likely due to dequantization overhead and less-mature library routines.

Conclusion

PEFT on quantized models preserves most of LoRA’s accuracy gains while substantially reducing GPU count and memory footprint, providing a practical path for accurate clinical information extraction in resource-constrained settings. This study was limited to a single architecture (Llama-3.1-8B) and use of clinical summaries that are less complex than routine clinical notes, which constrains generalizability of the results. Future work should test QLoRA across diverse architectures and sizes and on clinical corpora representative of real-world practice.

Version published to 10.1101/2025.10.21.25338506 on medRxiv
Oct 23, 2025

Breaking the Cost Barrier: How Quantization Enables Efficient Development and Deployment of LLMs for Public Healthcare

This article has 1 author:
1. Andrew Maranhão Ventura D’addario
This article has no evaluationsLatest version Nov 19, 2025
Improving Arabic Clinical Question Quality through Domain-Adaptive Masked Language Modeling

This article has 4 authors:
1. Walid Ounachad
2. Mohamed Khenchouch
3. Imad Zeroual
4. Yousef Farhaoui
This article has no evaluationsLatest version Nov 19, 2025
From text to tables: Zero-shot extraction of structured clinical data from free-text CT scan reports using foundational large language models

This article has 10 authors:
1. Alex Hongslo
2. Amulya Gupta
3. Quynh Nguyen
4. Jake Caldwell
5. Ben Choi
6. Christopher J. Harvey
7. Jeffrey Thompson
8. Diego Mazzotti
9. Zijun Yao
10. Amit Noheria
This article has no evaluationsLatest version Oct 7, 2025

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusion

Article activity feed

Related articles

Breaking the Cost Barrier: How Quantization Enables Efficient Development and Deployment of LLMs for Public Healthcare

Improving Arabic Clinical Question Quality through Domain-Adaptive Masked Language Modeling

From text to tables: Zero-shot extraction of structured clinical data from free-text CT scan reports using foundational large language models