Analysis of clinical, single cell, and spatial data from the Human Tumor Atlas Network (HTAN) with massively distributed cloud-based queries
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Cancer research increasingly relies on large-scale, multimodal datasets that capture the complexity of tumor ecosystems across diverse patients, cancer types, and disease stages. The Human Tumor Atlas Network (HTAN) generates such data, including single-cell transcriptomics, proteomics, and multiplexed imaging. However, the volume and heterogeneity of the data present challenges for researchers seeking to integrate, explore, and analyze these datasets at scale. To this end, HTAN developed a cloud-based infrastructure that transforms clinical and assay metadata into aggregate Google BigQuery tables, hosted through the Institute for Systems Biology Cancer Gateway in the Cloud (ISB-CGC). This infrastructure introduces two key innovations: (1) a provenance-based HTAN ID table that simplifies cohort construction and cross-assay integration, and (2) the novel adaptation of BigQuery’s geospatial functions for use in spatial biology, enabling neighborhood and correlation analysis of tumor microenvironments. We demonstrate these capabilities through R and Python notebooks that highlight use cases such as identifying precancer and organ-specific sample cohorts, integrating multimodal datasets, and analyzing single-cell and spatial data. By lowering technical and computational barriers, this infrastructure provides a cost-effective and intuitive entry point for researchers, highlighting the potential of cloud-based platforms to accelerate cancer discoveries.