Standardizing in-hospital cause-of-death data using large language models and neural machine translation

Ji Hyun Lee
Borim Ryu
Yu Kyeong Kim

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study aims to expand the Observational Medical Outcomes Partnership Common Data Model (CDM) by linking in-hospital mortality data with national mortality statistics. Specifically, we developed a pipeline utilizing large language models (LLMs) and neural machine translation (NMT) to standardize unstructured, bilingual cause-of-death text. From a cohort of 1,033,461 patients, 64,034 mortality records were identified. These records were converted into a standardized CDM format by mapping Korean Standard Classification of Diseases codes to the international standard, Systematized Nomenclature of Medicine—Clinical Terms. The pipeline structured 1,702 cases of unstructured, bilingual (Korean–English) free-text records containing medical abbreviations. The English translation performance of NMT- and LLM-based models was compared and subsequently validated against a ground truth reviewed by medical professionals. Finally, we employed retrieval-augmented generation-based LLM prompts to automatically assign International Classification of Diseases, Tenth Revision (ICD-10) codes to the translated text, identifying neoplasms (35.89%) and circulatory diseases (17.70%) as the primary causes of death. Technically, Nllb-200-1.3B achieved the highest translation accuracy, whereas SOLAR-10.7B and Llama 3.1-8B demonstrated superior performance in ICD-10 code matching. We successfully established a comprehensive mortality database and demonstrate that LLM-based modules are highly feasible for standardizing clinical big data, offering a scalable solution for enhancing real-world medical data interoperability.

Version published to 10.21203/rs.3.rs-8719534/v1 on Research Square
Mar 18, 2026

Developing a scalable pipeline for data extraction from clinical letters through resource-efficient prompt engineering

This article has 14 authors:
1. Ariel Yuhan Ong
2. Quang Nguyen
3. Ishani Barai
4. Justin Engelmann
5. Fares Antaki
6. Mertcan Sevgi
7. David A Merle
8. Lie Ju
9. Eliot Dow
10. Yukun Zhou
11. Gregory Maniatopoulos
12. Yemisi Takwoingi
13. Alastair K Denniston
14. Pearse A Keane
This article has no evaluationsLatest version Mar 10, 2026
A Next-Generation NLP Framework for Psychological Behavior Analysis Based on State-of-the-art Language Model

This article has 6 authors:
1. Mohit Kumar
2. Ashwani Kumar
3. Avinash Kumar Sharma
4. Nishant Gupta
5. Achyut Shankar
6. Gautam Kumar
This article has no evaluationsLatest version Apr 6, 2026
Clinical Safety of Large Language Models in Oral Cancer–Related Patient Communication: A Longitudinal Study

This article has 2 authors:
1. Burcu Yeliz KOLLAYAN
2. Tuğba CEBECİ
This article has no evaluationsLatest version Mar 16, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Developing a scalable pipeline for data extraction from clinical letters through resource-efficient prompt engineering

A Next-Generation NLP Framework for Psychological Behavior Analysis Based on State-of-the-art Language Model

Clinical Safety of Large Language Models in Oral Cancer–Related Patient Communication: A Longitudinal Study