CNSistent integration and feature extraction from somatic copy number profiles

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

The vast majority of cancers exhibit Somatic Copy Number Alterations (SCNAs)—gains and losses of variable regions of DNA. SCNAs can shape the phenotype of cancer cells, e . g . by increasing their proliferation rates, removing tumor suppressor genes, or immortalizing cells. While many SCNAs are unique to a patient, certain recurring patterns emerge as a result of shared selectional constraints or common mutational processes. To discover such patterns in a robust way, the size of the dataset is essential, which necessitates combining SCNA profiles from different cohorts, a non-trivial task.

To achieve this, we developed CNSistent, a Python package for imputation, filtering, consistent segmentation, feature extraction, and visualization of cancer copy number profiles from heterogeneous datasets. We demonstrate the utility of CNSistent by applying it to the publicly available TCGA, PCAWG, and TRACERx cohorts. We compare different segmentation and aggregation strategies on cancer type and subtype classification tasks using deep convolutional neural networks. We demonstrate an increase in accuracy over training on individual cohorts and efficient transfer learning between cohorts. Using integrated gradients we investigate lung cancer classification results, highlighting SOX2 amplifications as the dominant copy number alteration in lung squamous cell carcinoma.

Article activity feed

  1. AbstractThe vast majority of cancers exhibit Somatic Copy Number Alterations (SCNAs)—gains and losses of variable regions of DNA. SCNAs can shape the phenotype of cancer cells, e.g. by increasing their proliferation rates, removing tumor suppressor genes, or immortalizing cells. While many SCNAs are unique to a patient, certain recurring patterns emerge as a result of shared selectional constraints or common mutational processes. To discover such patterns in a robust way, the size of the dataset is essential, which necessitates combining SCNA profiles from different cohorts, a non-trivial task.To achieve this, we developed CNSistent, a Python package for imputation, filtering, consistent segmentation, feature extraction, and visualization of cancer copy number profiles from heterogeneous datasets. We demonstrate the utility of CNSistent by applying it to the publicly available TCGA, PCAWG, and TRACERx cohorts. We compare different segmentation and aggregation strategies on cancer type and subtype classification tasks using deep convolutional neural networks. We demonstrate an increase in accuracy over training on individual cohorts and efficient transfer learning between cohorts. Using integrated gradients we investigate lung cancer classification results, highlighting SOX2 amplifications as the dominant copy number alteration in lung squamous cell carcinoma.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf104), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 3: Sampsa Hautaniemi

    Streck and Schwarz present a method, CNSintent, for consistent segmentation of copy-number data. The utility of the tool is demonstrated using three large cancer cohorts and a neural network classifier built upon the consistently segmented data. CNSintent can facilitate solving an important biomedical problem: the advanced analysis of copy-number data. The authors are lauded for their excellent Python code and thorough documentation. While the contribution is timely and likely important, there are several areas for improvement.

    The manuscript's readability could be better. There are typos, textual errors, and inconsistencies in figure captions, such as incorrect figure references or mismatched values between the text and figures. The "Consistent Segmentation" section is difficult to follow. It is unclear whether this step involves merging pre-existing breakpoints in the data to produce new, longer segments or if larger segments, such as whole chromosomes, are split into smaller, constant-sized segments. The writing suggests that segments are first merged and then split; however, later in the manuscript, they appear to be used separately. In our testing, combining these approaches did not yield meaningful results. Since consistent segmentation is the method's most critical step, we strongly suggest clarifying this section.

    The manuscript is unbalanced in its content, with excessive focus on the tool's application and the discoveries derived from it, rather than on the tool itself. This reduces the clarity of the key message. We recommend compressing the application section (deep learning in cancer classification) while expanding the tool description with additional explanations.

    It is also unclear what type of data the authors are using in the cancer classification section. To improve clarity, this information should be explicitly included in the methods section, detailing the sequencing strategy and copy-number tools used for each cohort.

    The methods section would benefit from a more detailed explanation of the CNSintent steps. Both Figure 1 and the text leave some parts unclear, particularly in the "Consistent Segmentation" section. Additionally, methods such as random forest and UMAP are only briefly mentioned in a supplementary figure rather than being described in the methods section. Moving these descriptions to the methods section would improve clarity.

    Figures are generally clear, but improving color differentiation would be beneficial. For example, in Figure 1, the dark red and dark orange shades are too similar, making them difficult to distinguish. A more optimized color scheme with slightly lighter tones (i.e., increased luminance) would enhance readability.

    The introduction promotes copy-number signatures; however, these signatures rely on segment lengths and unique breakpoints, which vary between samples. Since this method enforces consistent segmentation and breakpoints across all samples, its applicability to copy-number signatures is unclear. This should be discussed in the Discussion section or removed from the introduction.

    Out of curiosity: Is it possible to prioritize one type of segmentation over another? For instance, if both WGS and WES data are available, can CNSintent be configured to prioritize WGS calls? Similarly, some tools provide highly precise breakpoint calls that are valuable for detecting fusion genes or rearrangements. In such cases, it would be useful to prioritize these calls and harmonize results from other tools accordingly.

    Terminology Clarifications:

    Blacklist, blacklisted regions, gap regions, mask: These terms should be used consistently, particularly since blacklists can be applied at different processing stages. Notably, PCAWG blacklists samples, not regions. Segmentation: The term is commonly used in CNV analysis to refer to inferring continuous genomic segments from raw read counts or probe intensities. Here, it has a slightly different meaning—computing consistent breakpoints across all samples—so a more explicit definition would be helpful. Breakpoint merging/clustering: If these terms are synonymous, choosing one would improve readability. Coverage: Since "coverage" often refers to sequencing depth, a critical quality metric in DNA sequencing, it might be clearer to use "copy-number coverage" or a similar term. For example, the sentence "Next, samples with low coverage were removed using the…" could be ambiguous if read without context.

    At the end of the subsection "Explainability and the Effect of SOX2 Gene," the phrase "which exhibits significant local amplification in LUSC" should be revised to "which exhibits significant focal amplification in LUSC." The correct terminology is "focal" rather than "local," as established in Beroukhim et al. (2010).

  2. AbstractThe vast majority of cancers exhibit Somatic Copy Number Alterations (SCNAs)—gains and losses of variable regions of DNA. SCNAs can shape the phenotype of cancer cells, e.g. by increasing their proliferation rates, removing tumor suppressor genes, or immortalizing cells. While many SCNAs are unique to a patient, certain recurring patterns emerge as a result of shared selectional constraints or common mutational processes. To discover such patterns in a robust way, the size of the dataset is essential, which necessitates combining SCNA profiles from different cohorts, a non-trivial task.To achieve this, we developed CNSistent, a Python package for imputation, filtering, consistent segmentation, feature extraction, and visualization of cancer copy number profiles from heterogeneous datasets. We demonstrate the utility of CNSistent by applying it to the publicly available TCGA, PCAWG, and TRACERx cohorts. We compare different segmentation and aggregation strategies on cancer type and subtype classification tasks using deep convolutional neural networks. We demonstrate an increase in accuracy over training on individual cohorts and efficient transfer learning between cohorts. Using integrated gradients we investigate lung cancer classification results, highlighting SOX2 amplifications as the dominant copy number alteration in lung squamous cell carcinoma.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf104), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2: Ellen Visscher

    The paper introduces a python package for imputation, filtering, segmentation, feature extraction and visualisation of CNA profiles. It explains some of the elements of the package, and then demonstrates how data from multiple cohorts can be processed and combined using the package preprocessing pipeline. The authors then use processed data from 3 different cohorts to perform cancer type prediction using a CNN. From this, they get an interesting result to find a biomarker that differentiates two different lung cancers. Throughout, they show visualisations using their package. The package itself seems well documented and designed to be used. There is some clarification required in the methods section specifically around the CNN training and the models therein. There is also one major question of whether all the preprocessing steps are actually required for the downstream CNN analysis. Overall, however, this is a well written manuscript, providing a useful software tool for further analysis of CNA data.

    Major comments:

    • CNN section- how are the segments decided- is it based on all the training data, or just data in a batch?
    • Throughout the results pertaining to figure 3A-C, you call it test accuracy- to be clear is this is based on your CV hold outs? This should be reworded everywhere to reflect this. As cross validation indicates, this is not a test set and is a validation set- which is also the way you use it.
    • Regarding the above, you have a comment saying: "the best test accuracy without cross-validation was 92.34%". Could you please clarify what you mean by this. Only in the CNN section do you describe your training approach, which does not mention a test or separate validation set.
    • It reads slightly unclearly- you have a section called "model transfer", but are you training 3 different models- one per dataset? You only have one figure for training results which suggests one dataset, but then you have this section called model transfer?
    • Re all the above, please dedicate a small subsection in methods making this clearer. Are there dedicated test sets? If your main results are for aggregated data, then what are you testing on to ensure generalisability? What is the point of training the 3 different models on 3 different datasets? Perhaps it would make more sense to hold one dataset out as your test set. In some ways, that is what the model transfer is showing, but it would be less confusing to clarify that aim instead of suddenly introducing 3 models.
    • If the CNN architecture is essentially the same as in Attique et. al., the performance is basically the same and they use only CNs a gene locations- how does this demonstrate that the preprocessing from CNSistent is necessary or advantageous for this task? Maybe having a result which combines CN calls naively over gene locations and comparing to this across the aggregate datasets would be a good way of comparing? I.e showing that preproccessing does offer an advantage when combining different datasets together? Also because this is what you argue in your abstract. For this analysis you would have to make sure you also compare across the same samples to differentiate between filtering/other preprocessing steps.
    • In Figure 3I, you say "notice the similarity of chromosome 3 pattern for the correctly classified LUSC samples (red) and the misclassified ones (orange)". This is confusing because the orange and red are not similar. In fact for this whole section, it seems that figure 3I does not align with what you are saying?

    Minor comments/errors:

    • Clarification on why CNSistent needs a reference genome if it's dealing with segments? How is this information used- is it just for the known gaps?
    • Your caption of Supplementary Figure 1 has a typo about a breakpoint at 16 instead of 14.
    • You do not explain how you use the knee pt to filter (i.e is it samples above/below the knee pt.)
    • Your CNN graphic is difficult to interpret and non-standard.
    • CNN section should clarify at the beginning what the input is and what the output is (i.e a prediction that a sample belongs to a particular cancer type) before explaining the architectural details.
    • Even though you control for class imbalance, some cancer types are so poorly represented it is unlikely a CNN could learn that, you do kind of mention this in the discussion, but maybe some sort of minimum threshold for inclusion would make sense.
    • For Fig2D you refer to it as GND, but the axes/title says hemizygosity-are these things equivalent? E.g could have 3-3, low hemizygosity but not diploid? Or if it's aggregated across the whole genome its assumed equivalent?
    • There is a grammatical error "Runtimes decreased in a near-linearly with the number of compute cores"
    • You make a comment that "We therefore suspect some TCGA lung cancers might be cases of co-occurring adeno and squamous carcinomas." This is a possibility but given pleiotropy of many phenotypes- it may also be that the biomarker is not always unique to squamous carcinomas.

    Suggestions/Nice to haves:

    • Maybe make it clearer inside the paper what visualisations come with CNSistent. Looking at the software documentation, there's obviously a lot of useful visualisations that come with that- and some of them you have used in Figure 3 for e.g.
    • Given there are more total CN callers, maybe good to mention somewhere how CNSistent would work for total CNs only.
    • You remove profiles that you say are uninformative, could you not include this and then just show how accuracy correlates with no. of break-pts (for e.g). In some ways one might think that there could be useful information in few alteration profiles- because those alterations might be more upstream/causal.
    • The aggregation step could maybe affect downstream analysis. I.e taking the average could introduce CNs that were never called. Even using min/max- this implies a constant copy number in that region, which may lose information- e.g if it is a functional region having two diff CNs across gene might imply non-functionality. Did you explore the effect of aggregation step? Perhaps taking a small enough resolution of segment types would account for this anyway.
  3. AbstractThe vast majority of cancers exhibit Somatic Copy Number Alterations (SCNAs)—gains and losses of variable regions of DNA. SCNAs can shape the phenotype of cancer cells, e.g. by increasing their proliferation rates, removing tumor suppressor genes, or immortalizing cells. While many SCNAs are unique to a patient, certain recurring patterns emerge as a result of shared selectional constraints or common mutational processes. To discover such patterns in a robust way, the size of the dataset is essential, which necessitates combining SCNA profiles from different cohorts, a non-trivial task.To achieve this, we developed CNSistent, a Python package for imputation, filtering, consistent segmentation, feature extraction, and visualization of cancer copy number profiles from heterogeneous datasets. We demonstrate the utility of CNSistent by applying it to the publicly available TCGA, PCAWG, and TRACERx cohorts. We compare different segmentation and aggregation strategies on cancer type and subtype classification tasks using deep convolutional neural networks. We demonstrate an increase in accuracy over training on individual cohorts and efficient transfer learning between cohorts. Using integrated gradients we investigate lung cancer classification results, highlighting SOX2 amplifications as the dominant copy number alteration in lung squamous cell carcinoma.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf104), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Stefano Monti

    This is a well-written paper that aims to develop a tool that can integrate SCNA from large datasets possibly generated using different platforms to identify alteration patterns that are often undetected in smaller data subsets. Authors have used CNN-based method for integrating the data, extracting features and predicting cancer types from SCNA profiles. The tool has the potential to significantly simplify the integration and analysis of large scale SCNA studies. However, some (hopefully addressable) weaknesses are noted:

    1. The choice of a classification task as the (only) way to evaluate the proposed method is questioned. I would argue that the most important use of SCNA detection is in support of mechanistic investigations, by identifying novel candidate loci likely to harbor tumor suppressors (copy losses) and oncogenes (copy gains). This type of analysis is hardly mentioned in the manuscript, and it is not clear how well the proposed tool would support it. I surmise it can, but the authors should discuss (and present results about) it.

    2. If we were to focus on the task of recurrent SCNA detection, then meta-analysis approaches (where separate analyses are performed on each of the datasets, and only the results are integrated) would need to be considered as an alternative to the approach here proposed (e.g., application of GISTIC to each of PCAWG, TCGA, TRACERx separately, followed by meta-analysis integration of the results). I am not saying meta-analysis would be superior, but the authors should discuss it, and possibly evaluate it.

    3. The reported metrics to quantify the quality of the integration are insufficient to assess the results. There is some lack of clarity about the classification accuracy results reported, since it is not clear whether all the components of the model building were adequately brought into the cross-validation (or train/test) loop. More specifically, when reporting the accuracy of the cancer type classification, it is reported that 1 megabase segmentation yields the best results. It is not clear if this size selection was performed within the train set only (and/or within the CV loop) or across the entire dataset. If the latter, this may significantly affect the accuracy results, which could not be deemed (unbiased) "test set" results. This should be clarified, and if the segment size selection was indeed performed outside the train/test split, accuracy measures should be computed again by performing the segment size selection properly (which of course it would mean a potentially different size would be selected for each of the folds).

    4. Comparisons with other methods: The authors only compare their method to random forest (RF). Related to the previous point: I presume the RF model used the segment size that was optimized for the CNN model (i.e., 1Mb). If this is the case, it would be an unfair comparison, since RF might favor a different size. Also, additional classifiers should be evaluated (e.g., Elastic Net, SVM, etc.).

    5. There is no sufficient discussion of existing tools/methods. This should be corrected (see also my comment about meta-analysis approaches).

    6. Metadata effects: Age influences the copy number alterations. The authors don't consider age or any other metadata and their implication in the classification task.

    7. Run time statistics and user requirement: While the authors report runtime curves per command (S Fig 6), it is difficult to translate this to total runtime. It would be useful if runtime for the entire training of a model were reported. Additionally, if available, comparison of run time stats with the established model that they cite would be useful.

    8. IG-based explanation. I found this section sort of perfunctory, not sufficiently justified, and adding little to the manuscript. IG is computationally expensive, and it does not provide any way to statistically quantify the found associations. Simpler methods, such as testing for association between SCNA occurrence and cancer type should be evaluated and compared to.

    9. Model selection: No adequate justification of why they picked CNN for this task when the referenced paper itself claims the DNN architecture performs better. Not sure but is this because of the varying segment size? Again, this is not clearly stated. https://pmc.ncbi.nlm.nih.gov/articles/PMC9203194/#tab1