Assessing Large Language Model Performance Related to Aging in Genetic Conditions

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Unlike some health conditions that have been extensively delineated throughout the lifespan, many genetic conditions are largely described in pediatric populations, with a focus on early manifestations like congenital anomalies and developmental delay. An apparent gap exists in understanding clinical features and optimal management as patients age. Generative artificial intelligence is transforming biomedical disciplines including through the introduction of large language models (LLMs). Motivated by these advances, we explored how LLMs handle age with respect to 282 genetic conditions selected based on prevalence. We divided these conditions into five categories: Disorders limited to childhood; Disorders limited to adulthood; Disorders with changes in presentation across ages; Disorders with changes in management across ages; Disorders with no changes across ages. We evaluated Llama-2-70b-chat (70b) and GPT-3.5 (GPT) capabilities at generating accurate medical vignettes for these conditions based on Correctness, Completeness, and Conciseness as graded by 3 clinicians. Using accurately generated vignettes as in-context prompts, we further generated and evaluated patient-geneticist dialogues and assessed LLM performance in answering specific questions regarding age-based management plans for a subset of conditions. Results revealed impressive performances of 70b with in-context prompting and GPT in generating vignettes. We overall did not observe age-based biases, though our experiments identified statistically significant differences in some areas related to LLM output. Despite impressive capabilities, LLMs still have limitations in clinical applications.

Article activity feed