Assessing GPT and DeepL for Terminology Translation in the Medical Domain: A Comparative Study on the Human Phenotype Ontology
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background This paper presents a comparative study of two state-of-the-art language models, OpenAI's GPT and DeepL, in the context of terminology translation within the medical domain. Methods This study was conducted on the Human Phenotype Ontology (HPO), which is used in medical research and diagnosis. Medical experts assess the performance of both models on a set of 120 translated HPO terms, employing a 4-point Likert scale (strongly agree = 1, agree = 2, disagree = 3, strongly disagree = 4). An independent reference translation from the HeTOP database was used to validate the quality of the translation. Results The average Likert rating for the 120 selected HPO terms was 1.29 for GPT-3.5 and 1.37 for DeepL. The comparison with HeTOP revealed a high degree of similarity between the machine translations and the reference translations. Conclusions The results indicate that both GPT and DeepL are effective at translating HPO terms from English to German. Statistical analysis revealed no significant differences in the mean ratings between the two models, confirming their comparable performance in terms of translation quality. The study not only illustrates the potential of machine translation but also shows incomplete coverage of translated medical terminology. This underscores the relevance of this study for cross-lingual medical research. However, the evaluation methods need to be further refined, and specific translation issues need to be addressed.