Paralog dispensability shapes homozygous deletion patterns in tumor genomes

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

Genomic instability is a hallmark of cancer, resulting in tumor genomes having large numbers of genetic aberrations, including homozygous deletions of protein coding genes. That tumor cells remain viable in the presence of such gene loss suggests high robustness to genetic perturbation. In model organisms and cancer cell lines, paralogs have been shown to contribute substantially to genetic robustness—they are generally more dispensable for growth than singletons. Here, by analyzing copy number profiles of > 10,000 tumors, we test the hypothesis that the increased dispensability of paralogs shapes tumor genome evolution. We find that genes with paralogs are more likely to be homozygously deleted and that this cannot be explained by other factors known to influence copy number variation. Furthermore, features that influence paralog dispensability in cancer cell lines correlate with paralog deletion frequency in tumors. Finally, paralogs that are broadly essential in cancer cell lines are less frequently deleted in tumors than non‐essential paralogs. Overall, our results suggest that homozygous deletions of paralogs are more frequently observed in tumor genomes because paralogs are more dispensable.

Article activity feed

  1. Note: This rebuttal was posted by the corresponding author to Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Reply to the reviewers

    Reviewer #1 (Evidence, reproducibility and clarity):

    Summary:
    In this study, the authors delineate the association of paralog dispensability with the frequency of homozygous deletions (HDs) and thereby show that paralog dispensability can play a significant role in shaping tumor genomes. The authors analyzed the strength of negative selection on the paralogs relative to the singletons using frequencies of the homozygous deletions (HD). The study focused on HDs because they ensure a complete loss of function, unlike other mutational aberrations that can be masked because of haplo-sufficiency. While accounting for potential confounding factors, authors find that paralogs tend to have a relatively high frequency of HDs, suggesting a relaxed negative selection. Furthermore, the authors specifically attribute this association to the dispensable paralogs by analyzing gene inactivation data generated from multiple experimental systems. Overall, the findings of this study can potentially have significant implications in cancer biology field and specifically to the researchers studying cancer genome evolution.

    We thank the reviewer for the careful reading and positive assessment of our manuscript

    Major comments:

    1. To dissect further which dispensable paralogs are more likely to be associated with a high HD frequency, synthetic lethal paralogs could be compared with non-synthetic lethal ones.

    In the section titled 'Homozygous deletion frequency of paralog passengers is influenced by paralog properties' (begins from line #289), authors have shown that paralogs with a high frequency of HDs are more likely to have the properties of dispensability (in Figure 4). It seems that all of those properties are also associated with synthetic lethality as the authors identified in their previous study (DeKegel et al. 2021). Furthermore, as shown in the subsequent section ('Essential paralogs are less frequently homozygously deleted than non-essential paralogs', begins from line #344), the high HD is associated with the dispensable paralogs. Some of those dispensable paralogs are expected to be synthetic lethal. Therefore, the association of paralogs with a high frequency of HDs with experimentally validated or predicted sets of synthetic lethal paralogs could be tested. This may help authors to contextualize their findings in terms of genetic interactions between paralogs.

    We thank the reviewer for highlighting the potential relationship with our previous work. We agree that many of these properties are associated with synthetic lethality, but we note that they are also associated with single gene essentiality. This makes the relationship between synthetic lethality, essentiality, and deletion frequency somewhat difficult to dissect.

    Nonetheless we have tested, in a number of ways, whether there is a relationship between a paralog having a reported/predicted synthetic lethality and being homozygously deleted. We find no obvious connection between the two.

    We first tested using a set of synthetic lethal interactions identified by integrating molecular profiling data with genome wide CRISPR screens in a large panel of cancer cell lines (the data used to train the classifier in De Kegel et al, 2021). As there is an ascertainment bias in this dataset (paralogs must have frequent loss of function alterations / silencing to be tested) we restricted our analysis to only those paralog pairs tested for synthetic lethality. We identified no clear pattern (p>0.05, Fisher's Exact Test).

    We next tested using an integrated set of four combinatorial CRISPR screens (aggregated in De Kegel et al) where we considered a pair to be synthetic lethal if it was a hit in any screen and not synthetic lethal if it was screened at least once and never identified as a hit. Again we restricted our analysis to paralogs that were present in this dataset to prevent issues with ascertainment bias. We found no clear association.

    We further tested using a consensus dataset derived from the same combinatorial screens, where a pair were marked as synthetic lethal if they were identified as a hit in at least two screens and not synthetic lethal if they were screened at least twice and never identified as a hit. Again we restricted our analysis to paralogs that were present in this dataset and found no clear association.

    We finally tested using our predicted synthetic lethal interactions – annotating the top 3% of predictions as synthetic lethal and the remainder as non-synthetic lethal. The 3% threshold is similar to the observed frequency of synthetic lethality in the training set. In this case, as this dataset covers all paralogs considered, no restriction was necessary.

    None of the above analyses revealed a clear relationship between deletion frequency and synthetic lethality. A caveat of these analyses is that none of the experimental datasets are complete (covering only a minority of all paralog pairs) and they are all somewhat noisy. Furthermore, as we show in our modelling analysis (Fig S3) the observed homozygous deletions are far from saturating.

    However we think there may be a simpler explanation, beyond limitations of the data, for why we do not observe a relationship between HDs and synthetic lethality.

    As the reviewer notes, there is evidence in cell lines that one reason paralogs are more dispensable than singletons is because of buffering / redundant relationships as revealed by synthetic lethal interactions. These relationships therefore provide an explanation for why some paralogs are dispensable. As our primary claim is that paralogs are more frequently deleted because they are more dispensable we might anticipate a relationship between deletion frequency and synthetic lethality. However, by definition, synthetic lethal interactions can only be observed for non-essential (dispensable) genes. Therefore when analysing the overlap with synthetic lethal interactions we are primarily restricting our analyses to genes that are already individually dispensable. Consequently we might not anticipate observing any enrichment. The buffering relationship revealed by synthetic lethality provides an explanation for why a paralog is dispensable but once we are restricting our analysis to dispensable paralogs we do not necessarily expect to see further enrichment.

    We think that an ideal way to explore this question further would be to look at selection on deletions of pairs of paralogs – we anticipate that if a gene is dispensable because of paralog buffering then both paralogs should not be deleted simultaneously. However, the current copy number datasets are too small to evaluate such pairwise relationships. This is discussed in manuscript as follows:

    Analyzing the frequency with which two members of a paralog family are lost would provide more direct insight into the contribution of paralog redundancy, but due to the overall rarity of passenger gene HDs, we cannot make a comprehensive assessment of co-deletions here – e.g. among paralog pairs where both genes are non-drivers, and not on the same chromosome, only two pairs are co-deleted in at least one TCGA sample. Larger cohorts would also allow us to search for patterns of mutual exclusivity of HDs to identify genetic interactions – this has been applied for identifying interactions between driver genes [57,58], but is more challenging for interactions between non-driver genes, which are much less frequently altered.

    Minor comments:

    1. The number of TCGA and ICGC tumor samples analyzed:
      As mentioned in the Results section (line #106), 9966 tumor samples were analyzed. However, the sample size mentioned in Figure 2A is 9951. Is the lower number shown in the figure due to the filtering procedure mentioned in the Methods section (line #455)? The change in sample sizes could be explained. A similar difference in sample sizes exists for the ICGC data also.

    The difference was indeed due to filtering process, but numbers were only provided in the methods. We have now addressed this in the main text :

    After removing a small number of ‘hyper-deleted’ samples (see Methods) we retained 9,951 samples for further analysis.

    1. The rationale behind setting the threshold at 100 HD genes to classify 'hyper-deleted' samples for TCGA (line #462) and ICGC data (line #473) could be explained.

    We excluded hyper-deleted samples to avoid any individual sample having undue influence on the genes observed to be ever deleted or indeed to influence the overall patterns observed. It is also common in analyses of selection in tumours that make use of mutational profiles (rather than copy number profiles) to exclude hypermutated samples (e.g. Martincorena et al, Cell 2017; Lopez et al, Nature 2020). However the exact threshold of 100 samples was somewhat arbitrary and this query prompted us to assess whether it had any significant impact on the results.

    We therefore repeated all analyses using a more stringent threshold (50 samples) and also without thresholding. Although the exact percentages and odds-ratios vary somewhat with the different thresholds, all major conclusions are still supported.

    We appreciate that this was minor comment and that reviewer did not request this new analysis, but in the absence of a strong justification for a single threshold we felt it appropriate to assess multiple thresholds (and none).

    1. Citation for DepMap is missing (caption of Figure 5). We have added the text below to the legend for Figure 5 :

    Essential genes for the DepMap dataset (Meyers et al, 2017) are obtained from a version of the data reprocessed in (De Kegel et al, 2021) to reduce off-target sgRNA effects (see Methods).

    CROSS-CONSULTATION COMMENTS
    Along the lines of Reviewer #3's second major comment, I have a suggestion, the possible benefits of which would depend on the target audience to which the authors intend to communicate their study.

    I would suggest including a brief comparison of the findings of this study which deal with human paralogs, with the findings in model organisms such as yeast, perhaps in the discussion section. To facilitate such a comparison, authors could try measuring the enrichments of, for example, molecular functions, gene families, types of genetic interactions, etc., among the paralogs associated with a high frequency of HDs and then discussing their comparison with what is known in the literature for paralogs in other model organisms that tend to be frequently deleted.

    Such a comparison could be of interest to the community of researchers working on other model organisms and put this study in a much broader context. However, as I said before, this would depend on the authors' intended target audience.

    We thank the reviewer for the suggestion. We have added an additional section to the discussion highlighting differences and similarities to the observations from yeast as follows:

    Much of our understanding of the factors that influence gene dispensability comes from studies in model organisms, in particular the budding yeast Saccharomyces cerevisiae [3,9,10,43,44]. Analyses of the yeast gene deletion collection, a set of gene deletion mutants systematically generated in a single S. cerevisiae strain, revealed that paralogs were less likely to be essential than singleton genes [3,45]. Furthermore, more detailed analyses of yeast paralogs revealed that paralogs from large families were less likely to be essential as were genes with highly sequence similar paralogs [43,44]. Previous analyses, including our own, demonstrated that many of these trends are also evident when analyzing gene essentiality from CRISPR screens in cancer cell lines [12,13,15,35]. Our results here are also consistent with these findings – many of the features that are associated with paralog dispensability in yeast are also associated with gene deletion frequency in tumor genomes.

    The connection between the budding yeast observations and those in cancer is less clear when it comes to the relative dispensability of WGDs and SSDs. Analyses of the yeast gene deletion collection revealed that SSDs are more likely to be essential than WGDs in the single genetic background studied [43,44]. In our previous analyses of gene essentiality in hundreds of cancer cell lines we found that SSDs were more likely to be broadly essential (essential in most cell lines) than WGDs but that WGDs were less likely to be never essential (i.e. more likely to be essential in at least one cell line)[13]. As the analyses of gene essentiality in budding yeast were generated in a single genetic background the concordance with our cancer cell line results was difficult to assess, but as gene deletion collections are now being generated in additional yeast strains it should become possible to perform a more direct comparison[46–48].

    Here we found that WGDs are less likely to be deleted than SSDs in tumors. This is surprising in light of the yeast gene deletion collection results, where SSDs were more likely to be essential than WGDs in the strain studied, but less so in light of the cancer cell line results, where WGDs were less likely to be never essential. It is also worth noting that experimental evolution studies in yeast found that SSDs accumulate protein-altering mutations at a higher rate than WGDs [49,50]. These results are perhaps especially relevant when analyzing the influence of paralog features on selection in tumors.

    We note that there are many additional differences in the features of WGDs and SSDs in budding yeast that may alter their relative dispensability in tumors. An obvious large scale difference is that in the ancestor of humans there were two rounds of whole genome duplication compared to a single duplication event in yeast[51,52]. Less obvious, but potentially of importance for cancer, is that the two classes of paralogs are enriched in pathways in humans that do not have obvious counterparts in yeast. For example, WGDs are highly enriched in signaling pathways involved in development while SSDs are enriched in immune response genes[53]. How the membership of these pathways influences the dispensability and selection of genes in tumors and cancer cell lines warrants further study.

    Reviewer #1 (Significance):

    As the authors note in their manuscript, it is expected that paralog dispensability could be associated with the relaxed negative selection in tumor genomes because (1) paralogs are prevalent in the human genome, and (2) many of them are dispensable, as apparent from the large-scale gene inactivation screens in hundreds of cancer cell lines (Blomen et al. 2015, Wang et al. 2015, Dandage and Landry 2019, De Kegel and Ryan 2019). However, direct mapping of this association, while importantly accounting for potential confounding factors, has been lacking.
    As a researcher with prior experience in the research topics such as gene duplication and genetic interactions, it appears to me that this study presents formal proof of the important association between paralog dispensability and tumor genome evolution which could be of major implication for the research community of cancer biology field and specifically to the researchers dealing with the topics such as cancer evolution, copy number alterations in cancer genomes, and synthetic lethality-based precision oncology therapeutics.

    Thank you again for the positive assessment.

    Reviewer #2 (Evidence, reproducibility and clarity):

    Summary

    Here, De Kegel & Ryan analyse thousands of tumour samples from the TCGA and ICGC projects to identify homozygously deleted genes, finding that about 40% of protein-coding genes are deleted in at least one sample. They find homozygously deleted genes to be enriched for paralogous genes, and further, more frequently deleted genes are increasingly likely to be paralogs. The authors then test the influence of several factors on the likelihood of being deleted, including gene length, distance to a fragile site or chromosomal region, and distance to a recurrently deleted tumour suppressor gene (TSG). They find that proximity of a TSG, telomere, centromere, and fragile site all increase likelihood of being deleted in a sample, as does gene length. Having a paralog also remains an important predictor of deletion after accounting for these other factors. Additionally, the more similar in sequence the closest paralog is to the gene and having a larger gene family size are also predictive of deletion. Conversely, if a gene is a whole genome duplicate as opposed to a small-scale duplicate, it is less likely to be deleted. Finally, the authors test the hypothesis that paralogs that are deleted in cancer are less likely to be essential and find that this is indeed the case.

    Comments

    The authors have done a good job of identifying trends of paralog deletion in cancer samples and the factors influencing them. The results are well described and presented and support the conclusions. I like the inclusion of the saturation analysis as an estimate of what to expect given current and potential future sample sizes, and I appreciate the inclusion of a WGD/SSD paralog distinction. The data and methods are sufficiently detailed. I have a few minor comments below.

    We thank the reviewer for the careful reading and positive assessment of our manuscript

    1. Around line 160 in the text and supplemental figure 4A, the authors test if the trends they see are observed across individual cancer types. With 9 of 33 cancer types reaching a sample size threshold, 8 of 9 comparisons are significant. The authors do not state correcting for multiple testing.

    We have now also assessed the significance of the results after performing a Holm-Bonferroni correction for multiple hypothesis testing and find that all 8/9 cancer types remain significant.

    1. I initially misunderstood the hemizygously deletion analysis, thinking the analysis in supplement figure 4B/C was asking if a sample has any singleton or any paralog deleted and comparing the number of samples with any deletion of either - given the number of genes deleted per sample this wouldn't make sense as a good test. I think the authors are actually comparing the number of loss-of-hemizygosity events per gene and grouping by paralog/singleton. I think this is a good analysis, but I think it would be helpful to clarify this in the text and figure legend e.g. "Samples w/ gene LOH" could be "LOH events per gene" or something similar.

    As suggested we have now updated the y-axis label in these charts to ‘LOH events per gene’. We note that there are now two additional panels in this figure to address copy neutral LOH, per Reviewer 3’s request.

    1. Occasionally, I wanted some more detail in the text for context, which was sometimes later provided - e.g. I noted when reading about line 125 that I was curious at this point how often TSGs occurred on segments, and this was later provided on line 241. Similarly, around line 114 I was curious how many genes are typically deleted per HD segment, for which the median value was provided on line 206 (and distribution in supplemental figure 1), and again for hemizygous deletions. I think sometimes it would be helpful to provide this context earlier in the text to aid interpretation of the results.

    We thank the reviewer for these suggestions which we have now incorporated into the text.

    On line 115 (previously 114) the relevant sentence now reads:

    Typically an HD that results in the loss of a protein coding gene also results in the loss of several chromosomally adjacent genes – in the TCGA dataset a median of three genes are lost per gene-deleting HD segment

    On line 124 the relevant sentence now reads:

    We found that almost half (49%) of the HDs that result in the loss of at least one protein coding gene overlap a known tumor suppressor.

    1. In the discussion, on line 420, the authors include the point that a paralog might not be required at all in a tumour cell and therefore easily deleted. I think this possibility could be expanded on here and in the introduction/results section, as it is an important point. I think it would be helpful to include more about the possibility that a paralog might be deleted in a tumour cell because it is simply not required or that is more likely to have less of a phenotypic impact compared to a singleton, and that this could be a reason for the observed enrichment of paralogs in deleted genes. A citation to support this point could be Áine N O'Toole, Laurence D Hurst, Aoife McLysaght, Faster Evolving Primate Genes Are More Likely to Duplicate, Molecular Biology and Evolution, Volume 35, Issue 1, January 2018, Pages 107-118, https://doi.org/10.1093/molbev/msx270. Duplicate genes can be duplicates because copy number variation of them has minimal impact.

    We thank the reviewer for raising this important point.

    We have briefly addressed this in the introduction as follows:

    In multiple model organisms, paralogs have been demonstrated to be more dispensable than singletons (genes without a paralog) [3–5]. There are a number of reasons for why a paralog might be more dispensable than a singleton gene, including preferential retention of duplications of non-essential genes [6,7], but perhaps the most obvious explanation is buffering between paralogs.

    Where references 6 and 7 are as follows:

    1. O’Toole ÁN, Hurst LD, McLysaght A. Faster Evolving Primate Genes Are More Likely to Duplicate. Mol Biol Evol. 2018;35: 107–118.
    2. He X, Zhang J. Higher duplicability of less important genes in yeast genomes. Mol Biol Evol. 2006;23: 144–151.

    We discuss this more comprehensively in the discussion as follows:

    In both yeast and cancer there are a number of reasons for why paralogs might be more dispensable than singleton genes. Perhaps the most obvious is the existence of buffering relationships between paralog pairs, such that when one paralog is lost the other paralog can compensate for this loss. Such buffering relationships between paralogs can be revealed through synthetic lethality screens and a number of recurrently deleted paralogs in cancer have already been reported to display synthetic lethal interactions with their paralog (recently reviewed in [54]). Supporting this model, in previous work analysing essentiality in cancer cell lines we found that buffering relationships between paralogs could explain 13-17% of cases where a paralog was essential in some cell lines but not others[13]. This suggests that at least some of the increased dispensability of paralogs in cancer cells can be attributed to buffering relationships between paralog pairs. However this is not the only explanation for paralogs displaying increased dispensability in tumour cells. An additional explanation is that paralogs may perform essential functions in specific contexts (e.g. within specific tissues or at specific developmental stages) but are not required within the specific context of a tumour. Consistent with this model, human paralogs are more likely to display tissue-specific expression patterns [55]. Finally we note that there is evidence to suggest that genes whose perturbation has a lower phenotypic impact may more ‘duplicable’ – i.e. rather than paralogs being under weaker selection because they are duplicated, their duplication was tolerated because they were already under weaker selection[6,7]. Teasing apart the relative contributions of these factors to the increased dispensability of paralogs in cancer will require further research and potentially new data resources such as gene essentiality profiles in diverse non-cancer cell types [56].

    CROSS-CONSULTATION COMMENTS
    I agree, that's a helpful suggestion from reviewer 1.

    Reviewer 3's suggestion regarding age of the two whole genome duplication events is quite difficult to unpick as the duplication events seem to have happened relatively close in time to each other while rediploidisation of the first was occurring. Additionally, paralogs from SSDs tend to be more similar in sequence simply because the two WGD events are relatively old while SSDs can occur at any time up to present. They're therefore biased by young duplicates that have not had the opportunity to diverged much and decrease in sequence similarity.

    We appreciate these comments.

    Reviewer #2 (Significance):

    This is a novel study as it examines the frequency of paralog deletion in cancer samples and the factors influencing it, building upon work already conducted in cancer cell lines. This study extends the knowledge of the field confirming previous trends observed in cell lines, this time in actual cancer samples. It confirms that paralogs are more dispensable than singletons, likely because they have a similar counterpart that can provide some level of functional redundancy. The more similar the closest paralog, the more likely it is to be deleted provides support for this.
    It is certainly limited by the number of samples currently available in the two cancer sample projects included but the authors attempt to quantify how limiting this sample size is by conducting a saturation analysis using down-sampling to estimate how many gene deletions one can expect from different numbers of samples. This is important as the lack of observance of many gene deletions is likely due to the limited sample size and not due to negative selection. This low observance of gene deletions disappointingly limits further testing beyond single paralogs to consider the deletion effects of multiple gene family members and more directly test evidence of functional redundancy between paralogs. The authors provide a good discussion of the limitations of their study.

    The results are of interest to evolutionary biologists and cancer biologists. Those with an interest in duplicate genes, and/or factors affecting gene loss in tumours will be interested in this work.

    My field of expertise is molecular evolution, gene duplication and copy number variation.

    We thank the reviewer for the positive assessment of the significance of our work.

    Reviewer #3 (Evidence, reproducibility and clarity):

    Thank you review "Paralog dispensability shapes homozygous deletion patterns in tumor genomes" by DeKegel et al. This manuscript uses TCGA and ICGC tumor data to show evidence for paralog dispensability. They analyze the rate of homozygous deletions and show that it is higher for paralogs compared to singletons. Their findings are robust to a number of confounding variables that they take into account e.g. distance to tumor suppressor, telomere, centromere or fragile site. They show that paralogs that belong to large families and have higher sequence identity tend to show more dispensability and these dispensable paralogs are less likely to be WGD.

    We thank the reviewer for the time taken to review our manuscript.

    Major comments.

    1. Does the finding pertaining to lack of enrichment of paralogs in regions LOH take into account whether LOH is copy neutral or not i.e. how does dosage affects this finding? Is it possible that there is a difference in paralog rate in LOH that results in total copy 1 and that the presence of copy neutral LOH masks the effect? Also, Integration of gene expression dataset would be helpful to resolve the difference between dosage paralog that compensate of the lack of their sister by upregulating their gene expression.

    In the submitted manuscript we focussed solely on LOH events where the copy number of one allele was 0 and the other allele was ≥1. These include copy loss events (total copy number = 1), copy neutral events (total copy = 2), as well as amplifications (total copy number > 2). The rationale for this approach was that we were interested in understanding whether the mechanism that was generating deletions was preferentially generating deletions in paralog-rich regions.

    However, we agree that understanding the influence of dosage is of interest here. We have therefore expanded the analysis in the paper to separately assess the enrichment of paralogs in copy neutral LOH regions (total copy number = 2) and copy loss LOH regions (total copy number = 1).

    As shown in the new updated Figure S4B we do not find an enrichment of paralogs in genes subject to either copy neutral LOH or copy loss LOH.

    The relevant section of the text on page 6 now reads :

    We do not find that paralogs are more frequently subject to LOH than singletons in either the TCGA or ICGC cohort (Fig. S4B-C); when considering all LOH segments we even see that singletons are slightly more frequently subject to LOH in the ICGC cohort (Fig. S4C, left), but when considering only focal LOH segments – i.e. segments whose length is less than half of the chromosome arm’s length, which is the case for all HD segments – there is no significant difference between paralog and singleton LOH frequency in either cohort. To assess whether gene dosage influenced the observed LOH frequency we further restricted our analysis to copy neutral LOH events (total copy number = 2) and copy loss LOH events (total copy number = 1) and again found no significant increase in deletion frequency of paralogs compared to singletons (Fig. S4B-C).

    Regarding the integration of gene expression to identify dosage compensation between paralogs – we agree that this is extremely interesting. However, it is quite challenging to address properly. Most paralogs are only observed to be homozygously deleted a single time and so statistically identifying how loss of one gene impacts the mRNA abundance of another is challenging. In the minority of cases where a paralog is recurrently deleted, often these deletions occur in samples from different cancer types and so integrating transcriptomic data still presents some technical challenges. Given this complexity, and as the question of dosage compensation is not central to our key observations, we have not integrated transcriptomic data here.

    1. Is the finding that paralogs are depleted among WGD is influenced by the age of WGD since there are 2 WGD events? Do SSD tend to be more or less similar by seq than WGD? This should be explored further since this observation is the opposite of what is observed in model organisms such as yeast whereby SSD are less functionally similar than WGD and often show properties similar to singletons than WGD.

    As noted by reviewer 2 in the cross commentary, it is extremely challenging to age the duplicates that arose from the WGD due to the close temporal proximity of the two whole genome duplication events. In the dataset of paralogs analysed used here, SSDs have lower average sequence identity than WGDs. However we note that both sequence identity and duplication type are included in our regression analysis (Figure 4D) and both are significantly associated with homozygous deletion frequently.

    This should be explored further since this observation is the opposite of what is observed in model organisms such as yeast whereby SSD are less functionally similar than WGD and often show properties similar to singletons than WGD.

    We do not actually think that our results are in opposition to the findings from model organisms. The bulk of studies on the functional consequences of deletions of SSDs/WGDs in model organisms are derived from analyses of the budding yeast gene deletion collection, which is generated in a single strain and grown in lab conditions. Consequently these studies report on which genes can be lost in a single genetic background when grown in rich media. We think it is not fully clear how these findings will apply in the context of a panel of genetically heterogenous tumours derived from multiple different cell types. We note that there are additional complexities when analysing human genes (tissue types, two rounds of WGD, metazoan specific pathways enriched in either WGDs/SSDs) that make a straightforward comparison with yeast challenging. We also note that although the results of analyses of the yeast gene deletion collection suggest that SSDs are more likely to be essential than WGDs, experimental evolution studies have demonstrated that SSDs are more likely to accumulate protein altering mutations than SSDs (Keane et al, Genome Research 2014; Fares et al, PLoS Genetics 2013). This is not what would expect based on the analyses of the yeast gene deletion collection, but is closer to what we observe for tumour genomes where SSDs are more likely to be homozygously deleted.

    We agree that we did not adequately discuss these issues in the previous version of our manuscript and so have added a new section to the discussion where we compare our results here with those from budding yeast:

    Much of our understanding of the factors that influence gene dispensability comes from studies in model organisms, in particular the budding yeast Saccharomyces cerevisiae [3,9,10,43,44]. Analyses of the yeast gene deletion collection, a set of gene deletion mutants systematically generated in a single S. cerevisiae strain, revealed that paralogs were less likely to be essential than singleton genes [3,45]. Furthermore, more detailed analyses of yeast paralogs revealed that paralogs from large families were less likely to be essential as were genes with highly sequence similar paralogs [43,44]. Previous analyses, including our own, demonstrated that many of these trends are also evident when analyzing gene essentiality from CRISPR screens in cancer cell lines [12,13,15,35]. Our results here are also consistent with these findings – many of the features that are associated with paralog dispensability in yeast are also associated with gene deletion frequency in tumor genomes.

    The connection between the budding yeast observations and those in cancer is less clear when it comes to the relative dispensability of WGDs and SSDs. Analyses of the yeast gene deletion collection revealed that SSDs are more likely to be essential than WGDs in the single genetic background studied [43,44]. In our previous analyses of gene essentiality in hundreds of cancer cell lines we found that SSDs were more likely to be broadly essential (essential in most cell lines) than WGDs but that WGDs were less likely to be never essential (i.e. more likely to be essential in at least one cell line)[13]. As the analyses of gene essentiality in budding yeast were generated in a single genetic background the concordance with our cancer cell line results was difficult to assess, but as gene deletion collections are now being generated in additional yeast strains it should become possible to perform a more direct comparison[46–48].

    Here we found that WGDs are less likely to be deleted than SSDs in tumors. This is surprising in light of the yeast gene deletion collection results, where SSDs were more likely to be essential than WGDs in the strain studied, but less so in light of the cancer cell line results, where WGDs were less likely to be never essential. It is also worth noting that experimental evolution studies in yeast found that SSDs accumulate protein-altering mutations at a higher rate than WGDs [49,50]. These results are perhaps especially relevant when analyzing the influence of paralog features on selection in tumors.

    We note that there are many additional differences in the features of WGDs and SSDs in budding yeast that may alter their relative dispensability in tumors. An obvious large scale difference is that in the ancestor of humans there were two rounds of whole genome duplication compared to a single duplication event in yeast[51,52]. Less obvious, but potentially of importance for cancer, is that the two classes of paralogs are enriched in pathways in humans that do not have obvious counterparts in yeast. For example, WGDs are highly enriched in signaling pathways involved in development while SSDs are enriched in immune response genes[53]. How the membership of these pathways influences the dispensability and selection of genes in tumors and cancer cell lines warrants further study.

    Minor comments

    1. There is a missing reference on line 55.

    We thank the reviewer for catching this oversight. We have now added a reference to Zerbino et al, NAR 2018 on this line.

    CROSS-CONSULTATION COMMENTS
    That's a good suggestion by reviewer 1. Homozygous deletion collection is available in yeast so these data can be used directly in addition tot he haploid gene deletion collection data.

    Since authors of this manuscript included in their analysis the comparison of WGD and SSD then they should do it more thoroughly. It is not sufficient what they presented here especially given that it contradicts the findings from model organisms.

    As noted above we have now added a significant discussion of the yeast findings and also of the SSD/WGD observations

    Reviewer #3 (Significance):

    This work provides the first systematic assessment of paralog dispensability specifically looking at homozygous deletions of paralogs across primary tumor samples and builds on the existing findings in cancer cell lines. It will be broadly interesting to those studying duplicated gene evolution and genome robustness. My expertise is in complex genetic networks in yeast and human cancer as well as genome evolution.

    We thank the reviewer for the positive assessment of our manuscript.

  2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #3

    Evidence, reproducibility and clarity

    Thank you review "Paralog dispensability shapes homozygous deletion patterns in tumor genomes" by DeKegel et al. This manuscript uses TCGA and ICGC tumor data to show evidence for paralog dispensability. They analyze the rate of homozygous deletions and show that it is higher for paralogs compared to singletons. Their findings are robust to a number of confounding variables that they take into account e.g. distance to tumor suppressor, telomere, centromere or fragile site. They show that paralogs that belong to large families and have higher sequence identity tend to show more dispensability and these dispensable paralogs are less likely to be WGD.

    Major comments.

    1. Does the finding pertaining to lack of enrichment of paralogs in regions LOH take into account whether LOH is copy neutral or not i.e. how does dosage affects this finding? Is it possible that there is a difference in paralog rate in LOH that results in total copy 1 and that the presence of copy neutral LOH masks the effect? Also, Integration of gene expression dataset would be helpful to resolve the difference between dosage paralog that compensate of the lack of their sister by upregulating their gene expression.
    2. Is the finding that paralogs are depleted among WGD is influenced by the age of WGD since there are 2 WGD events? Do SSD tend to be more or less similar by seq than WGD? This should be explored further since this observation is the opposite of what is observed in model organisms such as yeast whereby SSD are less functionally similarthan WGD and often show properties similar to singletons than WGD.

    Minor comments

    1. There is a missing reference on line 55.

    Referees cross-commenting

    That's a good suggestion by reviewer 1. Homozygous deletion collection is available in yeast so these data can be used directly in addition tot he haploid gene deletion collection data.

    Since authors of this manuscript included in their analysis the comparison of WGD and SSD then they should do it more thoroughly. It is not sufficient what they presented here especially given that it contradicts the findings from model organisms.

    Significance

    This work provides the first systematic assessment of paralog dispensability specifically looking at homozygous deletions of paralogs across primary tumor samples and builds on the existing findings in cancer cell lines. It will be broadly interesting to those studying duplicated gene evolution and genome robustness. My expertise is in complex genetic networks in yeast and human cancer as well as genome evolution.

  3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #2

    Evidence, reproducibility and clarity

    Summary

    Here, De Kegel & Ryan analyse thousands of tumour samples from the TCGA and ICGC projects to identify homozygously deleted genes, finding that about 40% of protein-coding genes are deleted in at least one sample. They find homozygously deleted genes to be enriched for paralogous genes, and further, more frequently deleted genes are increasingly likely to be paralogs. The authors then test the influence of several factors on the likelihood of being deleted, including gene length, distance to a fragile site or chromosomal region, and distance to a recurrently deleted tumour suppressor gene (TSG). They find that proximity of a TSG, telomere, centromere, and fragile site all increase likelihood of being deleted in a sample, as does gene length. Having a paralog also remains an important predictor of deletion after accounting for these other factors. Additionally, the more similar in sequence the closest paralog is to the gene and having a larger gene family size are also predictive of deletion. Conversely, if a gene is a whole genome duplicate as opposed to a small-scale duplicate, it is less likely to be deleted. Finally, the authors test the hypothesis that paralogs that are deleted in cancer are less likely to be essential and find that this is indeed the case.

    Comments

    The authors have done a good job of identifying trends of paralog deletion in cancer samples and the factors influencing them. The results are well described and presented and support the conclusions. I like the inclusion of the saturation analysis as an estimate of what to expect given current and potential future sample sizes, and I appreciate the inclusion of a WGD/SSD paralog distinction. The data and methods are sufficiently detailed. I have a few minor comments below.

    1. Around line 160 in the text and supplemental figure 4A, the authors test if the trends they see are observed across individual cancer types. With 9 of 33 cancer types reaching a sample size threshold, 8 of 9 comparisons are significant. The authors do not state correcting for multiple testing.
    2. I initially misunderstood the hemizygously deletion analysis, thinking the analysis in supplement figure 4B/C was asking if a sample has any singleton or any paralog deleted and comparing the number of samples with any deletion of either - given the number of genes deleted per sample this wouldn't make sense as a good test. I think the authors are actually comparing the number of loss-of-hemizygosity events per gene and grouping by paralog/singleton. I think this is a good analysis, but I think it would be helpful to clarify this in the text and figure legend e.g. "Samples w/ gene LOH" could be "LOH events per gene" or something similar.
    3. Occasionally, I wanted some more detail in the text for context, which was sometimes later provided - e.g. I noted when reading about line 125 that I was curious at this point how often TSGs occurred on segments, and this was later provided on line 241. Similarly, around line 114 I was curious how many genes are typically deleted per HD segment, for which the median value was provided on line 206 (and distribution in supplemental figure 1), and again for hemizygous deletions. I think sometimes it would be helpful to provide this context earlier in the text to aid interpretation of the results.
    4. In the discussion, on line 420, the authors include the point that a paralog might not be required at all in a tumour cell and therefore easily deleted. I think this possibility could be expanded on here and in the introduction/results section, as it is an important point. I think it would be helpful to include more about the possibility that a paralog might be deleted in a tumour cell because it is simply not required or that is more likely to have less of a phenotypic impact compared to a singleton, and that this could be a reason for the observed enrichment of paralogs in deleted genes. A citation to support this point could be Áine N O'Toole, Laurence D Hurst, Aoife McLysaght, Faster Evolving Primate Genes Are More Likely to Duplicate, Molecular Biology and Evolution, Volume 35, Issue 1, January 2018, Pages 107-118, https://doi.org/10.1093/molbev/msx270. Duplicate genes can be duplicates because copy number variation of them has minimal impact.

    Referees cross-commenting

    I agree, that's a helpful suggestion from reviewer 1.

    Reviewer 3's suggestion regarding age of the two whole genome duplication events is quite difficult to unpick as the duplication events seem to have happened relatively close in time to each other while rediploidisation of the first was occurring. Additionally, paralogs from SSDs tend to be more similar in sequence simply because the two WGD events are relatively old while SSDs can occur at any time up to present. They're therefore biased by young duplicates that have not had the opportunity to diverged much and decrease in sequence similarity.

    Significance

    This is a novel study as it examines the frequency of paralog deletion in cancer samples and the factors influencing it, building upon work already conducted in cancer cell lines. This study extends the knowledge of the field confirming previous trends observed in cell lines, this time in actual cancer samples. It confirms that paralogs are more dispensable than singletons, likely because they have a similar counterpart that can provide some level of functional redundancy. The more similar the closest paralog, the more likely it is to be deleted provides support for this.

    It is certainly limited by the number of samples currently available in the two cancer sample projects included but the authors attempt to quantify how limiting this sample size is by conducting a saturation analysis using down-sampling to estimate how many gene deletions one can expect from different numbers of samples. This is important as the lack of observance of many gene deletions is likely due to the limited sample size and not due to negative selection. This low observance of gene deletions disappointingly limits further testing beyond single paralogs to consider the deletion effects of multiple gene family members and more directly test evidence of functional redundancy between paralogs. The authors provide a good discussion of the limitations of their study.

    The results are of interest to evolutionary biologists and cancer biologists. Those with an interest in duplicate genes, and/or factors affecting gene loss in tumours will be interested in this work.

    My field of expertise is molecular evolution, gene duplication and copy number variation.

  4. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #1

    Evidence, reproducibility and clarity

    Summary:

    In this study, the authors delineate the association of paralog dispensability with the frequency of homozygous deletions (HDs) and thereby show that paralog dispensability can play a significant role in shaping tumor genomes. The authors analyzed the strength of negative selection on the paralogs relative to the singletons using frequencies of the homozygous deletions (HD). The study focused on HDs because they ensure a complete loss of function, unlike other mutational aberrations that can be masked because of haplo-sufficiency. While accounting for potential confounding factors, authors find that paralogs tend to have a relatively high frequency of HDs, suggesting a relaxed negative selection. Furthermore, the authors specifically attribute this association to the dispensable paralogs by analyzing gene inactivation data generated from multiple experimental systems. Overall, the findings of this study can potentially have significant implications in cancer biology field and specifically to the researchers studying cancer genome evolution.

    Major comments:

    1. To dissect further which dispensable paralogs are more likely to be associated with a high HD frequency, synthetic lethal paralogs could be compared with non-synthetic lethal ones.
      In the section titled 'Homozygous deletion frequency of paralog passengers is influenced by paralog properties' (begins from line #289), authors have shown that paralogs with a high frequency of HDs are more likely to have the properties of dispensability (in Figure 4). It seems that all of those properties are also associated with synthetic lethality as the authors identified in their previous study (DeKegel et al. 2021). Furthermore, as shown in the subsequent section ('Essential paralogs are less frequently homozygously deleted than non-essential paralogs', begins from line #344), the high HD is associated with the dispensable paralogs. Some of those dispensable paralogs are expected to be synthetic lethal. Therefore, the association of paralogs with a high frequency of HDs with experimentally validated or predicted sets of synthetic lethal paralogs could be tested. This may help authors to contextualize their findings in terms of genetic interactions between paralogs.

    Minor comments:

    1. The number of TCGA and ICGC tumor samples analyzed:
      As mentioned in the Results section (line #106), 9966 tumor samples were analyzed. However, the sample size mentioned in Figure 2A is 9951. Is the lower number shown in the figure due to the filtering procedure mentioned in the Methods section (line #455)? The change in sample sizes could be explained. A similar difference in sample sizes exists for the ICGC data also.
    2. The rationale behind setting the threshold at 100 HD genes to classify 'hyper-deleted' samples for TCGA (line #462) and ICGC data (line #473) could be explained.
    3. Citation for DepMap is missing (caption of Figure 5).

    Referees cross-commenting

    Along the lines of Reviewer #3's second major comment, I have a suggestion, the possible benefits of which would depend on the target audience to which the authors intend to communicate their study.

    I would suggest including a brief comparison of the findings of this study which deal with human paralogs, with the findings in model organisms such as yeast, perhaps in the discussion section. To facilitate such a comparison, authors could try measuring the enrichments of, for example, molecular functions, gene families, types of genetic interactions, etc., among the paralogs associated with a high frequency of HDs and then discussing their comparison with what is known in the literature for paralogs in other model organisms that tend to be frequently deleted.

    Such a comparison could be of interest to the community of researchers working on other model organisms and put this study in a much broader context. However, as I said before, this would depend on the authors' intended target audience.

    Significance

    As the authors note in their manuscript, it is expected that paralog dispensability could be associated with the relaxed negative selection in tumor genomes because (1) paralogs are prevalent in the human genome, and (2) many of them are dispensable, as apparent from the large-scale gene inactivation screens in hundreds of cancer cell lines (Blomen et al. 2015, Wang et al. 2015, Dandage and Landry 2019, De Kegel and Ryan 2019). However, direct mapping of this association, while importantly accounting for potential confounding factors, has been lacking.
    As a researcher with prior experience in the research topics such as gene duplication and genetic interactions, it appears to me that this study presents formal proof of the important association between paralog dispensability and tumor genome evolution which could be of major implication for the research community of cancer biology field and specifically to the researchers dealing with the topics such as cancer evolution, copy number alterations in cancer genomes, and synthetic lethality-based precision oncology therapeutics.