Standardizing in-hospital cause-of-death data using large language models and neural machine translation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study aims to expand the Observational Medical Outcomes Partnership Common Data Model (CDM) by linking in-hospital mortality data with national mortality statistics. Specifically, we developed a pipeline utilizing large language models (LLMs) and neural machine translation (NMT) to standardize unstructured, bilingual cause-of-death text. From a cohort of 1,033,461 patients, 64,034 mortality records were identified. These records were converted into a standardized CDM format by mapping Korean Standard Classification of Diseases codes to the international standard, Systematized Nomenclature of Medicine—Clinical Terms. The pipeline structured 1,702 cases of unstructured, bilingual (Korean–English) free-text records containing medical abbreviations. The English translation performance of NMT- and LLM-based models was compared and subsequently validated against a ground truth reviewed by medical professionals. Finally, we employed retrieval-augmented generation-based LLM prompts to automatically assign International Classification of Diseases, Tenth Revision (ICD-10) codes to the translated text, identifying neoplasms (35.89%) and circulatory diseases (17.70%) as the primary causes of death. Technically, Nllb-200-1.3B achieved the highest translation accuracy, whereas SOLAR-10.7B and Llama 3.1-8B demonstrated superior performance in ICD-10 code matching. We successfully established a comprehensive mortality database and demonstrate that LLM-based modules are highly feasible for standardizing clinical big data, offering a scalable solution for enhancing real-world medical data interoperability.