An Agentic System for Natural Language Querying of The Cancer Genome Atlas Clinical Data

Rajashekar Korutla
Saeed Amal

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The Cancer Genome Atlas (TCGA) contains extensive clinical data for thousands of cancer patients, yet accessing this information remains challenging due to complex file structures and heterogeneous data formats. This paper presents a novel agentic system that enables natural language querying of TCGA clinical data through an intelligent combination of rule-based extraction, statistical analysis, and large language model capabilities. The system employs a two-tiered architecture consisting of a data extraction pipeline that consolidates patient information from multiple XML and PDF files into unified text documents, and an agentic query processor that interprets natural language queries, plans data extraction strategies, performs statistical analyses, and synthesizes human-readable responses. The system demonstrates two distinct operational modes: a fast path for direct patient lookups that bypasses language model processing, and a comprehensive analytical pipeline for complex cohort-level queries. Evaluation across ten cancer types comprising a total of 4597 patients demonstrates the system's ability to handle queries ranging from simple patient data retrieval to sophisticated comparative analyses across demographic and clinical subgroups. The system successfully extracted and analyzed clinical parameters including demographics, staging, treatments, and survival outcomes, while maintaining awareness of data limitations and potential biases. This work provides researchers with an accessible interface to TCGA clinical data, potentially accelerating hypothesis generation and exploratory data analysis in cancer research.

Version published to 10.21203/rs.3.rs-7557593/v1 on Research Square
Oct 10, 2025

Prompt-Orchestrated Large Language Models for Clinical Information Extraction

This article has 13 authors:
1. Livia Lilli
2. Andrea Rosati
3. Giovanni Paolo Tobia
4. Massimo Criscione
5. Federica Tomassini
6. Chiara Dachena
7. Alice Luraschi
8. Chiara Cantarini
9. Carolina De Maria
10. Luigi Congedo
11. Massimo Bernaschi
12. Stefano Patarnello
13. Anna Fagotti
This article has no evaluationsLatest version Jan 16, 2026
LLMAgent4Bio: LLM Agents for Biological Intelligence Across Genomics, Proteomics, Spatial Biology, and Biomedicine

This article has 9 authors:
1. Sajib Acharjee Dip
2. Dipanwita Mallick
3. Uddip Acharjee Shuvo
4. Shovito Barua Soummo
5. Fazle Rafsani
6. Bikash Kumar Paul
7. Nazifa Ahmed Moumi
8. Shafayat Ahmed
9. Liqing Zhang
This article has no evaluationsLatest version Dec 16, 2025
Emergence of Biological Structural Discovery in General-Purpose Language Models

This article has 1 author:
1. Liang Wang
This article has no evaluationsLatest version Jan 8, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Prompt-Orchestrated Large Language Models for Clinical Information Extraction

LLMAgent4Bio: LLM Agents for Biological Intelligence Across Genomics, Proteomics, Spatial Biology, and Biomedicine

Emergence of Biological Structural Discovery in General-Purpose Language Models