Benchmarking large language models for cell-free RNA diagnostic biomarker discovery
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large-language models (LLMs) can parse vast amounts of data and generate executable code, positioning them as promising tools for the development of biomarkers and classifiers from high-throughput omics data. Here, we benchmarked six LLMs, OpenAI’s o3 and GPT-4o, Anthropic’s Claude Opus 4 and Claude 3.7 Sonnet, and Google’s Gemini 2.5 Pro and Gemini 2.0 Flash, for disease classification based on plasma cell-free RNA (cfRNA) profiles obtained by RNA sequencing. We analyzed data from cohorts of children with Kawasaki disease (KD) or multisystem inflammatory syndrome in children (MIS-C), adults with active tuberculosis (TB) or other non-TB respiratory conditions, and individuals with myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) or sedentary lifestyle. We assessed two tasks: (i) gene-panel design, where each LLM mined public knowledge to nominate diagnostic genes for use in machine learning (ML), and (ii) end-to-end modeling, where LLMs built an ML workflow directly from raw RNA-seq counts. In the first task, the LLM-derived panels captured canonical immune pathways and outperformed randomly selected genes in all cohorts. They underperformed panels chosen by differential gene expression (DGE) analysis in the KD vs. MIS-C and ME/CFS cohorts but performed comparably or better for the TB cohort. In the second task, o3 produced classifiers for KD vs. MIS-C that performed just as well as conventional statistical methods without human intervention. Performance for TB and ME/CFS cohorts was slightly lower than the conventional approach. These findings delineate current capabilities and limitations of LLMs in diagnostics and open a path for their future use in biomarker discovery.