The Open Pediatric Cancer Project

Zhuangzhuang Geng
Eric Wafula
Ryan J Corbett
Yuanchao Zhang
Run Jin
Krutika S Gaonkar
Sangeeta Shukla
Komal S Rathi
Dave Hill
Aditya Lahiri
Daniel P Miller
Alex Sickler
Kelsey Keith
Christopher Blackden
Antonia Chroni
Miguel A Brown
Adam A Kraya
Kaylyn L Clark
Brian R Rood
Adam C Resnick
Nicholas Van Kuren
John M Maris
Alvin Farrel
Mateusz P Koptyra
Gerri R Trooskin
Noel Coleman
Yuankun Zhu
Stephanie Stefankiewicz
Zied Abdullaev
Asif T Chinwalla
Mariarita Santi
Ammar S Naqvi
Jennifer L Mason
Carl J Koschmann
Xiaoyan Huang
Sharon J Diskin
Kenneth Aldape
Bailey K Farrow
Weiping Ma
Bo Zhang
Brian M Ennis
Sarah Tasian
Saksham Phul
Matthew R Lueder
Chuwei Zhong
Joseph M Dybas
Pei Wang
Deanne Taylor
Jo Lynne Rokita

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (GigaScience)

Abstract

Background

In 2019, the Open Pediatric Brain Tumor Atlas (OpenPBTA) was created as a global, collaborative open-science initiative to genomically characterize 1,074 pediatric brain tumors and 22 patient-derived cell lines. Here, we present an extension of the OpenPBTA called the Open Pediatric Cancer (OpenPedCan) Project, a harmonized open-source multiomic dataset from 6,112 pediatric cancer patients with 7,096 tumor events across more than 100 histologies. Combined with RNA sequencing (RNA-seq) from the Genotype-Tissue Expression and The Cancer Genome Atlas projects, OpenPedCan contains nearly 48,000 total biospecimens (24,002 tumor and 23,893 normal specimens).

Findings

We utilized Gabriella Miller Kids First workflows to harmonize whole-genome sequencing (WGS), whole exome sequencing (WXS), RNA-seq, and Targeted Sequencing datasets to include somatic SNVs, indels, copy number variants, structural variants, RNA expression, fusions, and splice variants. We integrated summarized Clinical Proteomic Tumor Analysis Consortium whole-cell proteomics and phospho-proteomics data and miRNA sequencing data, as well as developed a methylation array harmonization workflow to include m-values, beta-values, and copy number calls. OpenPedCan contains reproducible, dockerized workflows in GitHub, CAVATICA, and Amazon Web Services (AWS) to deliver harmonized and processed data from over 60 scalable modules, which can be leveraged both locally and on AWS. The processed data are released in a versioned manner and accessible through CAVATICA or AWS S3 download (from GitHub) and queryable through PedcBioPortal and the National Cancer Institute’s pediatric Molecular Targets Platform. Notably, we have expanded Pediatric Brain Tumor Atlas molecular subtyping to include methylation information to align with the World Health Organization 2021 Central Nervous System Tumor classifications, allowing us to create research-grade integrated diagnoses for these tumors.

Conclusions

OpenPedCan data and its reproducible analysis module framework are openly available and can be utilized and/or adapted by researchers to accelerate discovery, validation, and clinical translation.

GigaScience
Sep 30, 2025
AbstractBackground In 2019, the Open Pediatric Brain Tumor Atlas (OpenPBTA) was created as a global, collaborative open-science initiative to genomically characterize 1,074 pediatric brain tumors and 22 patient-derived cell lines. Here, we present an extension of the OpenPBTA called the Open Pediatric Cancer (OpenPedCan) Project, a harmonized open-source multi-omic dataset from 6,112 pediatric cancer patients with 7,096 tumor events across more than 100 histologies. Combined with RNA-Seq from the Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA), OpenPedCan contains nearly 48,000 total biospecimens (24,002 tumor and 23,893 normal specimens).Findings We utilized Gabriella Miller Kids First (GMKF) workflows to harmonize WGS, WXS, RNA-seq, and Targeted Sequencing datasets to include somatic SNVs, InDels, CNVs, SVs, RNA …
AbstractBackground In 2019, the Open Pediatric Brain Tumor Atlas (OpenPBTA) was created as a global, collaborative open-science initiative to genomically characterize 1,074 pediatric brain tumors and 22 patient-derived cell lines. Here, we present an extension of the OpenPBTA called the Open Pediatric Cancer (OpenPedCan) Project, a harmonized open-source multi-omic dataset from 6,112 pediatric cancer patients with 7,096 tumor events across more than 100 histologies. Combined with RNA-Seq from the Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA), OpenPedCan contains nearly 48,000 total biospecimens (24,002 tumor and 23,893 normal specimens).Findings We utilized Gabriella Miller Kids First (GMKF) workflows to harmonize WGS, WXS, RNA-seq, and Targeted Sequencing datasets to include somatic SNVs, InDels, CNVs, SVs, RNA expression, fusions, and splice variants. We integrated summarized CPTAC whole cell proteomics and phospho-proteomics data, miRNA-Seq data, and have developed a methylation array harmonization workflow to include m-values, beta-vales, and copy number calls. OpenPedCan contains reproducible, dockerized workflows in GitHub, CAVATICA, and Amazon Web Services (AWS) to deliver harmonized and processed data from over 60 scalable modules which can be leveraged both locally and on AWS. The processed data are released in a versioned manner and accessible through CAVATICA or AWS S3 download (from GitHub), and queryable through PedcBioPortal and the NCI’s pediatric Molecular Targets Platform. Notably, we have expanded PBTA molecular subtyping to include methylation information to align with the WHO 2021 Central Nervous System Tumor classifications, allowing us to create research-grade integrated diagnoses for these tumors.Conclusions OpenPedCan data and its reproducible analysis module framework are openly available and can be utilized and/or adapted by researchers to accelerate discovery, validation, and clinical translation.

This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf093), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 2: Jacek Majewski

Shapiro et al. describe the Open Pediatric Cancer Project, a dataset, web portals, and a Github repository to facilitate data access, analysis, and encourage collaborations using pediatric cancer omics data. While the concept is inspired, it does not constitute a significant advance over the previously described OpenPBTA project. The goal of the manuscript may be to provide a pointer to the updated datasets and web resources, but this does not seem like a sufficient reason to publish. As far as I can tell, all of the information in the manuscript is already provided on the OpenPedCan Bioportal (which is really useful, to be fair) and on GitHub. To publish a manuscript just as a pointer to that information does not seem justifiable in my opinion.

Major Concerns:

Novelty and Validity of Key Features:

The manuscript highlights several key features of OpenPedCan, including data harmonization, multi-omic integration, reproducibility, scalability, versioned data releases, accessibility, alignment with WHO 2021 classifications, and the open-source framework. However, these features are not novel. Many of them represent standard practices in the field. Moreover, some claims appear questionable:

Reproducibility: While the authors claim reproducibility, using OpenPedCan's dockerized workflows would require significant computational resources (e.g., 98GB of CPU) or expensive cloud services (e.g., AWS).

Accessibility: The platform's interface requires users to have a Gmail account, limiting its accessibility. Alternative login options should be considered.

Open-Source Framework: The manuscript does not adequately address how the framework handles access to controlled data, such as those integrated from external sources like TARGET and TCGA, which may require restricted access permissions.

Lack of Novel Methodologies and Findings:

While OpenPedCan integrates data from existing workflows and portals (e.g., Gabriella Miller Kids First, TCGA), the manuscript does not clearly outline novel methodologies or scientific contributions. Most prominently, the submission appears to be an incremental extension of the previous manuscript describing OpenPBTA published in Cell Genomics 2023. The only potentially novel components appear to be proteomics and molecular subtyping based on methylation, but no specific examples or case studies demonstrating the novelty or impact of these contributions are provided.

Redundancy with Existing Tools:

The manuscript states that OpenPedCan serves as a community resource for addressing research questions and providing orthogonal validation datasets. However, there is nothing presented in OpenPedCan that cannot already be achieved with existing tools. This makes the claim somewhat redundant, as the platform largely serves as a data integrator rather than offering unique capabilities.

Minor Concerns:

Splicing Analysis Module:

The manuscript refers to a splicing analysis module (Figure 2: OpenPedCan Analysis Workflow), but there is no further description or discussion of this module within the text. Further elaboration is needed.

Incomplete Module Descriptions:

The manuscript describes several analysis modules, but it should provide more comprehensive descriptions of the analysis modules, especially the Splicing Analysis module.

Additionally, the Molecular Subtyping component, based on molecular and methylation data, is the only module with a clear methodological explanation.

Further clarification on the methods used in other modules would be beneficial.
Read the original source
GigaScience
Sep 30, 2025
AbstractBackground In 2019, the Open Pediatric Brain Tumor Atlas (OpenPBTA) was created as a global, collaborative open-science initiative to genomically characterize 1,074 pediatric brain tumors and 22 patient-derived cell lines. Here, we present an extension of the OpenPBTA called the Open Pediatric Cancer (OpenPedCan) Project, a harmonized open-source multi-omic dataset from 6,112 pediatric cancer patients with 7,096 tumor events across more than 100 histologies. Combined with RNA-Seq from the Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA), OpenPedCan contains nearly 48,000 total biospecimens (24,002 tumor and 23,893 normal specimens).Findings We utilized Gabriella Miller Kids First (GMKF) workflows to harmonize WGS, WXS, RNA-seq, and Targeted Sequencing datasets to include somatic SNVs, InDels, CNVs, SVs, RNA …
AbstractBackground In 2019, the Open Pediatric Brain Tumor Atlas (OpenPBTA) was created as a global, collaborative open-science initiative to genomically characterize 1,074 pediatric brain tumors and 22 patient-derived cell lines. Here, we present an extension of the OpenPBTA called the Open Pediatric Cancer (OpenPedCan) Project, a harmonized open-source multi-omic dataset from 6,112 pediatric cancer patients with 7,096 tumor events across more than 100 histologies. Combined with RNA-Seq from the Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA), OpenPedCan contains nearly 48,000 total biospecimens (24,002 tumor and 23,893 normal specimens).Findings We utilized Gabriella Miller Kids First (GMKF) workflows to harmonize WGS, WXS, RNA-seq, and Targeted Sequencing datasets to include somatic SNVs, InDels, CNVs, SVs, RNA expression, fusions, and splice variants. We integrated summarized CPTAC whole cell proteomics and phospho-proteomics data, miRNA-Seq data, and have developed a methylation array harmonization workflow to include m-values, beta-vales, and copy number calls. OpenPedCan contains reproducible, dockerized workflows in GitHub, CAVATICA, and Amazon Web Services (AWS) to deliver harmonized and processed data from over 60 scalable modules which can be leveraged both locally and on AWS. The processed data are released in a versioned manner and accessible through CAVATICA or AWS S3 download (from GitHub), and queryable through PedcBioPortal and the NCI’s pediatric Molecular Targets Platform. Notably, we have expanded PBTA molecular subtyping to include methylation information to align with the WHO 2021 Central Nervous System Tumor classifications, allowing us to create research-grade integrated diagnoses for these tumors.Conclusions OpenPedCan data and its reproducible analysis module framework are openly available and can be utilized and/or adapted by researchers to accelerate discovery, validation, and clinical translation.

This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf093), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer 1: Stephen R Piccolo

I love this type of work. This research will be invaluable to the wider research community of people studying pediatric cancers. It will save lots of time and frustration and move the field forward. The paper is well written. I have to admit that I am not well versed in all of the latest software tools and settings to use for processing all of the data types that the repository includes. So I cannot vouch for or against those. However, the tools that I am familiar with seem reasonable. I have a few comments / suggestions / questions.

How is patient privacy maintained? Sorry if I missed this. The paper mentions the original sources of the data. However, if I understand correctly, OpenPBTA has reprocessed versions of the data. What processes are used to regulate access to versions of the data that must be kept secure? Perhaps I am misunderstanding the ideas behind how this works.

Validation. It would be helpful if the paper could touch on the approach the authors use to ensure that data that they have (re)processed are valid. For example, are there any known findings that show up after the data have been reprocessed? Or are there other ways of assessing quality?

The paper mentions TCGA and GTex. It also mentions that adult data are included. But I didn't see a clear rationale for doing this.

The paper includes many links, some of which reference portions of the GitHub site. It would be best to display the URLs in the paper itself. It would also be useful to reference a Zenodo-archived version of the GitHub site so that there is a versioned record of the repository at the time of submission.

Supplementary Table 1 has a tab with information about the patient metadata ("Biospecimen-level metadata and clinical data"). However, I didn't see details in the paper about how these were harmonized. How did the authors ensure that the metadata values come from disparate sources were used consistently? What expertise did they have? How did they resolve inconsistencies or missing data? Supplementary Table 1 indicates a definition and a data type for each of these fields. It would be much more useful to provide ontology term(s) for each of these fields so that the metadata were machine readable.
Read the original source
Version published to 10.1093/gigascience/giaf093
Jan 1, 2025
Version published to 10.1101/2024.07.09.599086 on bioRxiv
Jul 11, 2024

Nationwide Genomic Data Analysis of Central Nervous System Tumors in Japan based on C-CAT Database

This article has 14 authors:
1. Daisuke Kawauchi
2. Makoto Ohno
3. Masamichi Takahashi
4. Takafumi Koyama
5. Kuniko Sunami
6. Makoto Hirata
7. Shunsuke Yanagisawa
8. Takaki Omura
9. Takuma Aoki
10. Genta Fujii
11. Koji Saito
12. Tetsuya Yamamoto
13. Hiromichi Suzuki
14. Yoshitaka Narita
This article has no evaluationsLatest version Jan 13, 2026
Comprehensive Transcriptomic Analysis and Biomarker Prioritization of Hydroxyprogesterone in Breast Cancer

This article has 4 authors:
1. Abdallah Rafi
2. Şükrü Tüzmen
3. Osman Uğur Sezerman
4. Fikret Dirilenoğlu
This article has no evaluationsLatest version Jan 20, 2026
Predicting gene expression from whole slide images in prostate cancer using deep learning

This article has 14 authors:
1. Anxuan Han
2. Bo Li
3. Chui Yan Mah
4. Jessica Logan
5. Yanan Wang
6. Ning Liu
7. Feargal Ryan
8. David Lynn
9. Darren Foreman
10. John O’Leary
11. Douglas Brooks
12. Jose Polo
13. Lisa Butler
14. Fuyi Li
This article has no evaluationsLatest version Feb 4, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Background

Findings

Conclusions

Article activity feed

Related articles

Nationwide Genomic Data Analysis of Central Nervous System Tumors in Japan based on C-CAT Database

Comprehensive Transcriptomic Analysis and Biomarker Prioritization of Hydroxyprogesterone in Breast Cancer

Predicting gene expression from whole slide images in prostate cancer using deep learning