Consistent Performance of GPT-4o in Rare Disease Diagnosis Across Nine Languages and 4967 Cases
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Large language models (LLMs) are increasingly used in the medical field for diverse applications including differential diagnostic support. The estimated training data used to create LLMs such as the Generative Pretrained Transformer (GPT) predominantly consist of English-language texts, but LLMs could be used across the globe to support diagnostics if language barriers could be overcome. Initial pilot studies on the utility of LLMs for differential diagnosis in languages other than English have shown promise, but a large-scale assessment on the relative performance of these models in a variety of European and non-European languages on a comprehensive corpus of challenging rare-disease cases is lacking.
Methods
We created 4967 clinical vignettes using structured data captured with Human Phenotype Ontology (HPO) terms with the Global Alliance for Genomics and Health (GA4GH) Phenopacket Schema. These clinical vignettes span a total of 378 distinct genetic diseases with 2618 associated phenotypic features. We used translations of the Human Phenotype Ontology together with language-specific templates to generate prompts in English, Chinese, Czech, Dutch, German, Italian, Japanese, Spanish, and Turkish. We applied GPT-4o, version gpt-4o-2024-08-06, to the task of delivering a ranked differential diagnosis using a zero-shot prompt. An ontology-based approach with the Mondo disease ontology was used to map synonyms and to map disease subtypes to clinical diagnoses in order to automate evaluation of LLM responses.
Findings
For English, GPT-4o placed the correct diagnosis at the first rank 19·8% and within the top-3 ranks 27·0% of the time. In comparison, for the eight non-English languages tested here the correct diagnosis was placed at rank 1 between 16·9% and 20·5%, within top-3 between 25·3% and 27·7% of cases.
Interpretation
The differential diagnostic performance of GPT-4o across a comprehensive corpus of rare-disease cases was consistent across the nine languages tested. This suggests that LLMs such as GPT-4o may have utility in non-English clinical settings.
Funding
NHGRI 5U24HG011449 and 5RM1HG010860. P.N.R. was supported by a Professorship of the Alexander von Humboldt Foundation; P.L. was supported by a National Grant (PMP21/00063 ONTOPREC-ISCIII, Fondos FEDER).