Benchmarking large language models for cell-free RNA diagnostic biomarker discovery

Hunter A. Gaudio
Andrew Bliss
Conor J. Loy
Daniel Eweis-LaBolle
Anne E. Gardella
Iwijn De Vlaminck

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large-language models (LLMs) can parse vast amounts of data and generate executable code, positioning them as promising tools for the development of biomarkers and classifiers from high-throughput omics data. Here, we benchmarked six LLMs, OpenAI’s o3 and GPT-4o, Anthropic’s Claude Opus 4 and Claude 3.7 Sonnet, and Google’s Gemini 2.5 Pro and Gemini 2.0 Flash, for disease classification based on plasma cell-free RNA (cfRNA) profiles obtained by RNA sequencing. We analyzed data from cohorts of children with Kawasaki disease (KD) or multisystem inflammatory syndrome in children (MIS-C), adults with active tuberculosis (TB) or other non-TB respiratory conditions, and individuals with myalgic encephalomyelitis/chronic fatigue syndrome (ME/CFS) or sedentary lifestyle. We assessed two tasks: (i) gene-panel design, where each LLM mined public knowledge to nominate diagnostic genes for use in machine learning (ML), and (ii) end-to-end modeling, where LLMs built an ML workflow directly from raw RNA-seq counts. In the first task, the LLM-derived panels captured canonical immune pathways and outperformed randomly selected genes in all cohorts. They underperformed panels chosen by differential gene expression (DGE) analysis in the KD vs. MIS-C and ME/CFS cohorts but performed comparably or better for the TB cohort. In the second task, o3 produced classifiers for KD vs. MIS-C that performed just as well as conventional statistical methods without human intervention. Performance for TB and ME/CFS cohorts was slightly lower than the conventional approach. These findings delineate current capabilities and limitations of LLMs in diagnostics and open a path for their future use in biomarker discovery.

Version published to 10.1101/2025.08.20.671358 on bioRxiv
Aug 24, 2025

Benchmarking Large Language Models for Predictive Modeling in Biomedical Research With a Focus on Reproductive Health

This article has 12 authors:
1. Reuben Sarwal
2. Victor Tarca
3. Claire Dubin
4. Nikolas Kalavros
5. Gaurav Bhatti
6. Sanchita Bhattacharya
7. Atul Butte
8. Roberto Romero
9. Gustavo Stolovitzky
10. Tomiko T. Oskotsky
11. Adi L. Tarca
12. Marina Sirota
This article has no evaluationsLatest version Jul 10, 2025
LeukGenePipeline: Modular Workflow for Genomic Datasets

This article has 2 authors:
1. Ana Carolina Pacífico dos Santos
2. Omar Arias-Gaguancela
This article has no evaluationsLatest version Aug 2, 2025
Large Language Models Improve Cancer Survival Prediction Using Real-World Clinical Notes

This article has 20 authors:
1. Niklas Kiermeyer
2. Tim Lenfers
3. Amin Dada
4. Julian Friedrich
5. Sameh Khattab
6. Eric Knop
7. Jan Egger
8. Markus Pauly
9. Andreas Jung
10. Grégoire Montavon
11. Jens T. Siveke
12. Marcel Wiesweg
13. Stefan Kasper
14. Ulf P. Neumann
15. Frederick Klauschen
16. Sylvia Hartmann
17. Martin Schuler
18. Philipp Keyl
19. Jens Kleesiek
20. Julius Keyl
This article has no evaluationsLatest version Aug 19, 2025

Listed in

Abstract

Article activity feed

Related articles

Benchmarking Large Language Models for Predictive Modeling in Biomedical Research With a Focus on Reproductive Health

LeukGenePipeline: Modular Workflow for Genomic Datasets

Large Language Models Improve Cancer Survival Prediction Using Real-World Clinical Notes