Breaking the Cost Barrier: How Quantization Enables Efficient Development and Deployment of LLMs for Public Healthcare

Andrew Maranhão Ventura D’addario

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The clinical promise of Large Language Models (LLMs) is often unrealized due to pro-hibitive computational costs. These costs create barriers not only to deployment in patient care but also to the vital process of fine-tuning models for specialized medical tasks and local patient populations. This study investigates 4-bit quantization as a methodology to make the entire clinical AI lifecycle—from development to implementation—both financially and practically viable. We performed a cost-benefit analysis using the Gemma 3 model family on the HealthQA-BR medical benchmark. We compared the diagnostic accuracy and computational resource requirements of standard full-precision models against their 4-bit quantized counterparts during both inference (clinical use) and QLoRA-based fine-tuning (model development). Quantization enabled massive efficiency gains with a clinically negligible impact on performance. For the 12B-parameter model, we observed a mere 1.3% absolute drop in accuracy. In exchange, computational requirements were reduced by 80% for fine-tuning and 69% for inference. This translates to a more than three-fold improvement in performance per unit of computational cost, accelerating research and development cycles. 4-bit quantization is a pivotal enabling technology for clinical AI. By drastically lowering the resource barrier for model customization and deployment, it empowers medical institutions to rapidly develop and validate specialized AI tools on-site. This approach holds particular promise for large-scale public health systems like Brazil’s SUS and provides a viable blueprint for similar health systems worldwide to transform AI from a theoretical possibility into a practical and equitable reality in patient care.

Version published to 10.1101/2025.11.17.25340460 on medRxiv
Nov 19, 2025

For clinical data extraction, QLoRA attains accuracy close to LoRA while requiring lower compute resources

This article has 3 authors:
1. Prabin R. Shakya
2. Ayush Khaneja
3. Kavishwar B. Wagholikar
This article has no evaluationsLatest version Oct 23, 2025
Leveraging simulation to provide a practical framework for assessing the novel scope of risk of LLMs in healthcare

This article has 4 authors:
1. Mark Kalinich
2. James Luccarelli
3. Frank Moss
4. John Torous
This article has no evaluationsLatest version Nov 13, 2025
Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models

This article has 4 authors:
1. Nazar Zaki
2. Amal Akor
3. Salahdein Aburuz
4. Sham ZainAlAbdin
This article has no evaluationsLatest version Oct 19, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

For clinical data extraction, QLoRA attains accuracy close to LoRA while requiring lower compute resources

Leveraging simulation to provide a practical framework for assessing the novel scope of risk of LLMs in healthcare

Beyond Accuracy: An Efficiency- and Safety-Aware Framework for Evaluating Clinical AI with Large Language Models