Evaluating Locally Deployed Gemma 3 27B on the Taiwanese Pulmologist’s Board Exam
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Large language models (LLMs) with multimodal capabilities have recently achieved human-level performance on medical examinations. However, most rely on cloud-based processing, raising concerns about data privacy in clinical contexts. Locally deployable LLMs may offer a secure alternative. Methods: We evaluated Google Gemma3-27B, a vision-capable LLM, on Taiwan’s pulmonary specialist board examination having 1200 multiple choices questions (MCQs) since 2013 to 2024 obtained from the Taiwan Society of Pulmonary and Critical Care Medicine. The dataset included 1,156 text-only and 44 text-and-image MCQs, classified into 26 categories by two board-certified pulmonologists. The model was deployed locally on a laptop equipped with an AMD Ryzen 5 7535HS CPU, 32 GB RAM, and an NVIDIA RTX 4060 GPU. Results: Gemma3-27B achieved statistically significant performance above random guessing (binomial 95% CI: 17–34/100). For text-only MCQs, Gemma3-27B correctly answered the following numbers of questions by year: 59, 59, 45, 53, 52, 54, 58, 45, 53, 50, 54, and 65. For the text-and-image MCQs, the scores were 0/1, 1/5, 2/4, 2/5, 0/3, 0/1, 0/1, 0/6, 3/4, 2/4, 3/7, and 0/3, where the denominator indicates the total number of MCQs for each year. Six categories achieved accuracies above 60%, including chest surgery (68.2%), pneumothorax (65.2%), and respiratory pathophysiology (63.3%). However, major categories such as lung cancer, infection, and critical care scored below 60%, highlighting a mismatch between domains with higher accuracy and those containing the most clinically relevant question sets. Conclusion: Despite slower inference times and the need for manual answer extraction, Gemma3-27B demonstrated competitive accuracy and approached the benchmark passing threshold while maintaining data privacy. These results support the feasibility of locally deployed LLMs as privacy-preserving tools for high-stakes medical applications. However, the domains with higher accuracy did not correspond to those containing the largest number of MCQs, many of which carry substantial clinical relevance. These clinically important domains should therefore be prioritized in future fine-tuning of locally deployed models.