Analysis of clinical, single cell, and spatial data from the Human Tumor Atlas Network (HTAN) with massively distributed cloud-based queries

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Cancer research increasingly relies on large-scale, multimodal datasets that capture the complexity of tumor ecosystems across diverse patients, cancer types, and disease stages. The Human Tumor Atlas Network (HTAN) generates such data, including single-cell transcriptomics, proteomics, and multiplexed imaging. However, the volume and heterogeneity of the data present challenges for researchers seeking to integrate, explore, and analyze these datasets at scale. To this end, HTAN developed a cloud-based infrastructure that transforms clinical and assay metadata into aggregate Google BigQuery tables, hosted through the Institute for Systems Biology Cancer Gateway in the Cloud (ISB-CGC). This infrastructure introduces two key innovations: (1) a provenance-based HTAN ID table that simplifies cohort construction and cross-assay integration, and (2) the novel adaptation of BigQuery’s geospatial functions for use in spatial biology, enabling neighborhood and correlation analysis of tumor microenvironments. We demonstrate these capabilities through R and Python notebooks that highlight use cases such as identifying precancer and organ-specific sample cohorts, integrating multimodal datasets, and analyzing single-cell and spatial data. By lowering technical and computational barriers, this infrastructure provides a cost-effective and intuitive entry point for researchers, highlighting the potential of cloud-based platforms to accelerate cancer discoveries.

Article activity feed