Semantic Encoding in Medical LLMs for Vocabulary Standardisation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

High-quality, standardised medical data availability remains a bot-tleneck for digital health and AI model development. A major hurdle is translating noisy free text into controlled clinical vocabularies, aiming for harmonisation and interoperability, especially when source datasets are inconsistent or incomplete. We bench-mark domain-specific encoder models against general LLMs for semantic-embedding retrieval using minimal vocabulary building blocks and test several prompt techniques. We also try prompt augmentation with LLM-generated differential definitions. We tested these prompts on open-source Llama and medically fine-tuned Llama models to steer their alignment toward accurate concept assignment across multiple prompt formats. Domain-tuned models consistently outperform general models of the same size in retrieval and generative tasks. However, performance is sensitive to prompt design and model size, and the benefits of adding LLM-generated context are inconsistent. While newer, larger foundation models are closing the gap, today’s lightweight open-source generative LLMs lack the stability and embedded clinical knowledge needed for deterministic vocabulary standardisation.

Article activity feed