The Open Pediatric Cancer Project

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Background

In 2019, the Open Pediatric Brain Tumor Atlas (OpenPBTA) was created as a global, collaborative open-science initiative to genomically characterize 1,074 pediatric brain tumors and 22 patient-derived cell lines. Here, we present an extension of the OpenPBTA called the Open Pediatric Cancer (OpenPedCan) Project, a harmonized open-source multi-omic dataset from 6,112 pediatric cancer patients with 7,096 tumor events across more than 100 histologies. Combined with RNA-Seq from the Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA), OpenPedCan contains nearly 48,000 total biospecimens (24,002 tumor and 23,893 normal specimens).

Findings

We utilized Gabriella Miller Kids First (GMKF) workflows to harmonize WGS, WXS, RNA-seq, and Targeted Sequencing datasets to include somatic SNVs, InDels, CNVs, SVs, RNA expression, fusions, and splice variants. We integrated summarized CPTAC whole cell proteomics and phospho-proteomics data, miRNA-Seq data, and have developed a methylation array harmonization workflow to include m-values, beta-vales, and copy number calls. OpenPedCan contains reproducible, dockerized workflows in GitHub, CAVATICA, and Amazon Web Services (AWS) to deliver harmonized and processed data from over 60 scalable modules which can be leveraged both locally and on AWS. The processed data are released in a versioned manner and accessible through CAVATICA or AWS S3 download (from GitHub), and queryable through PedcBioPortal and the NCI’s pediatric Molecular Targets Platform. Notably, we have expanded PBTA molecular subtyping to include methylation information to align with the WHO 2021 Central Nervous System Tumor classifications, allowing us to create research-grade integrated diagnoses for these tumors.

Conclusions

OpenPedCan data and its reproducible analysis module framework are openly available and can be utilized and/or adapted by researchers to accelerate discovery, validation, and clinical translation.

Article activity feed

  1. AbstractBackground In 2019, the Open Pediatric Brain Tumor Atlas (OpenPBTA) was created as a global, collaborative open-science initiative to genomically characterize 1,074 pediatric brain tumors and 22 patient-derived cell lines. Here, we present an extension of the OpenPBTA called the Open Pediatric Cancer (OpenPedCan) Project, a harmonized open-source multi-omic dataset from 6,112 pediatric cancer patients with 7,096 tumor events across more than 100 histologies. Combined with RNA-Seq from the Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA), OpenPedCan contains nearly 48,000 total biospecimens (24,002 tumor and 23,893 normal specimens).Findings We utilized Gabriella Miller Kids First (GMKF) workflows to harmonize WGS, WXS, RNA-seq, and Targeted Sequencing datasets to include somatic SNVs, InDels, CNVs, SVs, RNA expression, fusions, and splice variants. We integrated summarized CPTAC whole cell proteomics and phospho-proteomics data, miRNA-Seq data, and have developed a methylation array harmonization workflow to include m-values, beta-vales, and copy number calls. OpenPedCan contains reproducible, dockerized workflows in GitHub, CAVATICA, and Amazon Web Services (AWS) to deliver harmonized and processed data from over 60 scalable modules which can be leveraged both locally and on AWS. The processed data are released in a versioned manner and accessible through CAVATICA or AWS S3 download (from GitHub), and queryable through PedcBioPortal and the NCI’s pediatric Molecular Targets Platform. Notably, we have expanded PBTA molecular subtyping to include methylation information to align with the WHO 2021 Central Nervous System Tumor classifications, allowing us to create research-grade integrated diagnoses for these tumors.Conclusions OpenPedCan data and its reproducible analysis module framework are openly available and can be utilized and/or adapted by researchers to accelerate discovery, validation, and clinical translation.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf093), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2: Jacek Majewski

    Shapiro et al. describe the Open Pediatric Cancer Project, a dataset, web portals, and a Github repository to facilitate data access, analysis, and encourage collaborations using pediatric cancer omics data. While the concept is inspired, it does not constitute a significant advance over the previously described OpenPBTA project. The goal of the manuscript may be to provide a pointer to the updated datasets and web resources, but this does not seem like a sufficient reason to publish. As far as I can tell, all of the information in the manuscript is already provided on the OpenPedCan Bioportal (which is really useful, to be fair) and on GitHub. To publish a manuscript just as a pointer to that information does not seem justifiable in my opinion.

    Major Concerns:

    1. Novelty and Validity of Key Features:

    The manuscript highlights several key features of OpenPedCan, including data harmonization, multi-omic integration, reproducibility, scalability, versioned data releases, accessibility, alignment with WHO 2021 classifications, and the open-source framework. However, these features are not novel. Many of them represent standard practices in the field. Moreover, some claims appear questionable:

    • Reproducibility: While the authors claim reproducibility, using OpenPedCan's dockerized workflows would require significant computational resources (e.g., 98GB of CPU) or expensive cloud services (e.g., AWS).
    • Accessibility: The platform's interface requires users to have a Gmail account, limiting its accessibility. Alternative login options should be considered.
    • Open-Source Framework: The manuscript does not adequately address how the framework handles access to controlled data, such as those integrated from external sources like TARGET and TCGA, which may require restricted access permissions.
    1. Lack of Novel Methodologies and Findings:
    • While OpenPedCan integrates data from existing workflows and portals (e.g., Gabriella Miller Kids First, TCGA), the manuscript does not clearly outline novel methodologies or scientific contributions. Most prominently, the submission appears to be an incremental extension of the previous manuscript describing OpenPBTA published in Cell Genomics 2023. The only potentially novel components appear to be proteomics and molecular subtyping based on methylation, but no specific examples or case studies demonstrating the novelty or impact of these contributions are provided.
    1. Redundancy with Existing Tools:
    • The manuscript states that OpenPedCan serves as a community resource for addressing research questions and providing orthogonal validation datasets. However, there is nothing presented in OpenPedCan that cannot already be achieved with existing tools. This makes the claim somewhat redundant, as the platform largely serves as a data integrator rather than offering unique capabilities.

    Minor Concerns:

    1. Splicing Analysis Module:
    • The manuscript refers to a splicing analysis module (Figure 2: OpenPedCan Analysis Workflow), but there is no further description or discussion of this module within the text. Further elaboration is needed.
    1. Incomplete Module Descriptions:
    • The manuscript describes several analysis modules, but it should provide more comprehensive descriptions of the analysis modules, especially the Splicing Analysis module.
    • Additionally, the Molecular Subtyping component, based on molecular and methylation data, is the only module with a clear methodological explanation.
    • Further clarification on the methods used in other modules would be beneficial.
  2. AbstractBackground In 2019, the Open Pediatric Brain Tumor Atlas (OpenPBTA) was created as a global, collaborative open-science initiative to genomically characterize 1,074 pediatric brain tumors and 22 patient-derived cell lines. Here, we present an extension of the OpenPBTA called the Open Pediatric Cancer (OpenPedCan) Project, a harmonized open-source multi-omic dataset from 6,112 pediatric cancer patients with 7,096 tumor events across more than 100 histologies. Combined with RNA-Seq from the Genotype-Tissue Expression (GTEx) and The Cancer Genome Atlas (TCGA), OpenPedCan contains nearly 48,000 total biospecimens (24,002 tumor and 23,893 normal specimens).Findings We utilized Gabriella Miller Kids First (GMKF) workflows to harmonize WGS, WXS, RNA-seq, and Targeted Sequencing datasets to include somatic SNVs, InDels, CNVs, SVs, RNA expression, fusions, and splice variants. We integrated summarized CPTAC whole cell proteomics and phospho-proteomics data, miRNA-Seq data, and have developed a methylation array harmonization workflow to include m-values, beta-vales, and copy number calls. OpenPedCan contains reproducible, dockerized workflows in GitHub, CAVATICA, and Amazon Web Services (AWS) to deliver harmonized and processed data from over 60 scalable modules which can be leveraged both locally and on AWS. The processed data are released in a versioned manner and accessible through CAVATICA or AWS S3 download (from GitHub), and queryable through PedcBioPortal and the NCI’s pediatric Molecular Targets Platform. Notably, we have expanded PBTA molecular subtyping to include methylation information to align with the WHO 2021 Central Nervous System Tumor classifications, allowing us to create research-grade integrated diagnoses for these tumors.Conclusions OpenPedCan data and its reproducible analysis module framework are openly available and can be utilized and/or adapted by researchers to accelerate discovery, validation, and clinical translation.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf093), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Stephen R Piccolo

    I love this type of work. This research will be invaluable to the wider research community of people studying pediatric cancers. It will save lots of time and frustration and move the field forward. The paper is well written. I have to admit that I am not well versed in all of the latest software tools and settings to use for processing all of the data types that the repository includes. So I cannot vouch for or against those. However, the tools that I am familiar with seem reasonable. I have a few comments / suggestions / questions.

    • How is patient privacy maintained? Sorry if I missed this. The paper mentions the original sources of the data. However, if I understand correctly, OpenPBTA has reprocessed versions of the data. What processes are used to regulate access to versions of the data that must be kept secure? Perhaps I am misunderstanding the ideas behind how this works.
    • Validation. It would be helpful if the paper could touch on the approach the authors use to ensure that data that they have (re)processed are valid. For example, are there any known findings that show up after the data have been reprocessed? Or are there other ways of assessing quality?
    • The paper mentions TCGA and GTex. It also mentions that adult data are included. But I didn't see a clear rationale for doing this.
    • The paper includes many links, some of which reference portions of the GitHub site. It would be best to display the URLs in the paper itself. It would also be useful to reference a Zenodo-archived version of the GitHub site so that there is a versioned record of the repository at the time of submission.
    • Supplementary Table 1 has a tab with information about the patient metadata ("Biospecimen-level metadata and clinical data"). However, I didn't see details in the paper about how these were harmonized. How did the authors ensure that the metadata values come from disparate sources were used consistently? What expertise did they have? How did they resolve inconsistencies or missing data? Supplementary Table 1 indicates a definition and a data type for each of these fields. It would be much more useful to provide ontology term(s) for each of these fields so that the metadata were machine readable.