An Agentic System for Natural Language Querying of The Cancer Genome Atlas Clinical Data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The Cancer Genome Atlas (TCGA) contains extensive clinical data for thousands of cancer patients, yet accessing this information remains challenging due to complex file structures and heterogeneous data formats. This paper presents a novel agentic system that enables natural language querying of TCGA clinical data through an intelligent combination of rule-based extraction, statistical analysis, and large language model capabilities. The system employs a two-tiered architecture consisting of a data extraction pipeline that consolidates patient information from multiple XML and PDF files into unified text documents, and an agentic query processor that interprets natural language queries, plans data extraction strategies, performs statistical analyses, and synthesizes human-readable responses. The system demonstrates two distinct operational modes: a fast path for direct patient lookups that bypasses language model processing, and a comprehensive analytical pipeline for complex cohort-level queries. Evaluation across ten cancer types comprising a total of 4597 patients demonstrates the system's ability to handle queries ranging from simple patient data retrieval to sophisticated comparative analyses across demographic and clinical subgroups. The system successfully extracted and analyzed clinical parameters including demographics, staging, treatments, and survival outcomes, while maintaining awareness of data limitations and potential biases. This work provides researchers with an accessible interface to TCGA clinical data, potentially accelerating hypothesis generation and exploratory data analysis in cancer research.

Article activity feed