Large Language Models Struggle to Encode Medical Concepts — A Multilingual Benchmarking and Comparative Analysis
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Interoperability in health information systems is crucial for accurate data exchange across environments such as electronic health records, clinical notes, and medical research. The main challenge arises from the wide variation in biomedical concepts, their representation across different systems and languages, and the limited context, complicating data integration and standardization. Inspired by recent advances in large language models (LLMs), this study explores their potential role as biomedical knowledge engineers to (semi-)automate multilingual biomedical concept normalization, a key task for semantic interoperability of medical concepts. We developed a novel multilingual dataset comprising 59’104 unique terms mapped to 27’280 distinct biomedical concepts, designed to assess language model performance across this task within five European languages: English, French, German, Spanish, and Turkish. We then proposed a multi-stage pipeline based on a retrieve-then-rerank approach using sparse and dense retrievers, rerankers, and fusion approaches, leveraging discriminative and generative LLMs, with a predefined primary knowledge organization system. Our experiments show that the best discriminative model, e5, achieves an accuracy of 71%, surpassing the best generative model, Mistral, by 2% (p-value < 0.001). For semi-automated workflows, e5 maintained superior performance with 82% recall@10 versus Mistral’s 78%. Our findings demonstrate a pathway to how LLM-based approaches can advance the normalization of multilingual biomedical terms as well as the limitations of LLMs in encoding biomedical concepts.