How Can I Stay Healthy? – Benchmarking Large Language Models for Personalized and Biomarker-Based Intervention Recommendations

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background The integration of large language models (LLMs) into clinical workflows for diagnostics and intervention recommendations has gained interest due to their strong performance on various medical benchmarks. However, we lack benchmarks that assess their applicability for personalized interventions, specifically in geroscience and longevity medicine. Methods We extended the BioChatter framework for developing biomedical benchmarks for LLMs with the primary aim of assessing the ability of LLMs to generate personalized intervention recommendations based on biomarker profiles, while ensuring compliance with predefined validation requirements. We created 25 medically relevant personal profiles across three age groups, where people seek advice on interventions such as caloric restriction, intermittent fasting, exercise, and selected supplements and drugs. We then used these profiles to construct 1,000 test cases in a combinatorial fashion, simulating real-world user prompt variability. We evaluated multiple proprietary and open-source models using an LLM-as-a-judge approach, assessing 48,000 primary responses against expert-validated ground truths. Results Proprietary models outperformed open-source ones, particularly with respect to comprehensiveness. While LLMs largely succeed in providing explainable suggestions, their limited comprehensiveness may hinder informed decision-making. LLMs respond positively to more concrete instructions in the system prompt but remain vulnerable to overall prompt variability. Responses account well for the safety of interventions, potentially at the cost of lower utility. Moreover, LLM performance is heterogeneous across different age groups, displaying age-related biases, which may, however, be due to differential disease prevalence. Conclusion Our findings indicate that LLMs are not generally suitable for unsupervised preventive intervention recommendations due to inconsistent performance across key validation requirements, but proprietary models mostly perform well when evaluated by automated judgments assisted by expert commentaries. Our open-source benchmarking and evaluation framework provides a blueprint for advancing LLM evaluation in other medical contexts, enabling better AI-driven healthcare applications.

Article activity feed