Benchmarking large language models for cell-free RNA diagnostic biomarker discovery

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large-language models (LLMs) can parse vast amounts of data and generate executable code, positioning them as promising tools for the development of biomarkers and classifiers from high-throughput omics data. Here, we benchmarked six LLMs, OpenAI’s o3 and GPT-4o, Anthropic’s Claude Opus 4 and Claude 3.7 Sonnet, and Google’s Gemini 2.5 Pro and Gemini 2.0 Flash, for disease classification based on plasma cell-free RNA (cfRNA) profiles obtained by RNA sequencing. We analyzed data from cohorts of children with Kawasaki disease (KD) or multisystem inflammatory syndrome in children (MIS-C), adults with active tuberculosis (TB) or other non-TB respiratory conditions, and individuals with myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) or sedentary lifestyle. We assessed two tasks: (i) gene-panel design, where each LLM mined public knowledge to nominate diagnostic genes for use in machine learning (ML), and (ii) end-to-end modeling, where LLMs built an ML workflow directly from raw RNA-seq counts. In the first task, the LLM-derived panels captured canonical immune pathways and outperformed randomly selected genes in all cohorts. They underperformed panels chosen by differential gene expression (DGE) analysis in the KD vs. MIS-C and ME/CFS cohorts but performed comparably or better for the TB cohort. In the second task, o3 produced classifiers for KD vs. MIS-C that performed just as well as conventional statistical methods without human intervention. Performance for TB and ME/CFS cohorts was slightly lower than the conventional approach. These findings delineate current capabilities and limitations of LLMs in diagnostics and open a path for their future use in biomarker discovery.

Article activity feed