Landscape of oncoviral genotype and co-infection via human papilloma and hepatitis B viral tumor in situ profiling

This article has been Reviewed by the following groups

Read the full article

Abstract

No abstract available

Article activity feed

  1. ##Author Response

    ###Reviewer #1:

    This study is an in silico analysis of data from the Cancer Genome Atlas (TCGA) on hepatitis B virus (HBV)-positive liver tumours and human papillomavirus (HPV)-positive cervical and head and neck tumours and association with viral load, genotytpe(s) and expression. It is unclear to me the rationale behind including two unrelated DNA tumour viruses in the study, especially as the number of HBV-positive samples is much less than for HPV. Overall the manuscript seems to be a validation of a bioinformatic tool rather than reporting significant research findings.

    We strongly believe that a global summary of key oncoviral-associated tumors makes sense in this context precisely because of the fundamental importance viral genotype is already known to have. While HBV and HPV are of course quite different viruses, there is extensive clinical evidence that linking outcomes to specific viral genotypes and phenotypes is of great value, which we expand upon in our work via a working demonstration of ViralMine. For this reason we think it is crucial to present both virally related cohorts together as they support each other, demonstrate robustness our methods across completely different systems while allaying concerns about fine-tuning, and create a cohesive picture of the effect of viral genotype across the molecular landscape of two key onco-viruses. As the reviewer notes this does implicitly demonstrate the utility of ViralMine but we do emphasize that it also does uncover significant research findings.

    Concerning the HBV/HPV sample sizes, in fact the number and percentage of infected HCC samples is substantially higher than that of cervical or head and neck HPV samples as discussed in detail on page 4 of our manuscript.

    Use of the TCGA has allowed analysis of a reasonably large number of RNASeq data sets. However, once the authors drill down to individual genotypes, numbers become quite small, which may compromise some of the observation. For example, the large discrepancy between numbers of HPV16 (173) and 18(39)-positive cases makes it difficult to make firm conclusions about the significance of differentially expressed cellular genes for each set of cancers. Similarly, in Figures 4 and 6 they compare HPV18 (23 cases) with HPV45 (39 cases) and HPV18/45 coinfections (number not stated but likely far fewer).

    While there is an imbalance in group size between HPV genotypes in the cervical cancer cohort, the test statistic used by the DESeq2 pipeline to identify differentially expressed genes does account for class imbalance and even in the most extreme case we have analyzed the dispersion parameter estimates are easily verified as accurate. In fact accurately inferring group-wise dispersion parameters given unequal group sizes is a well-known problem, and in any case this problem only becomes acute when one group becomes so small (~1 sample) that it becomes difficult to estimate its common dispersion parameter. That situation clearly does not arise here. Additionally, in Figure 4b, it should be noted that we are comparing ALL HPV co-infected cervical tumor samples (92 cases) against single-infection samples (193 cases), which the reviewer may find more confidence in and which is obviously statistically reasonable. Furthermore, while the comparison of cervical cancer HPV18 (n=10), HPV45 (n=9), and HPV18/45 coinfected (n=39) cases in Figure 6b does compare relatively small patient groups, the significant difference in neoantigen population TCR binding affinity is confirmed by a one-sided, non-parametric KS-Test and shown to be robust to subsampling, which formally demonstrates that the signal is not artefactual. Therefore from a statistical point of view the concerns raised about class imbalance and power are not fundamental and were addressed in the original manuscript draft. Thus, we believe we can completely address the reviewer’s concerns by:

    In Figure 3a, Figure 4a and b, signify the group sizes (n=X) compared in the barcode plots to improve transparency in the contrasts, and additionally add group numbers to Figure 6a and b. Further, we will include a new supplementary figure demonstrating that a bootstrap resampling of the HPV group neoantigens to balance for group size validates that the difference in TCR binding affinity distributions is robust.

    Much of the information that they derive from their analyses is not novel. For example, they report no preferential sites of HPV integration. Despite what they claim, quite a bit is known about HPV co-infection in cervical cancers and it is not uncommon but varies according to geographical regions, which was not a variable they used.

    We acknowledge that other oncoviral survey papers have provided evidence of preferential integration (as we originally cited, as well as referenced in Dall et al. (2008), Zhang et al. (2016)). However, these and other previous characterizations of recurrent HPV integration do not attempt to organize these sites by either genotype or co-infection status, which was our explicit and stated aim, principally because they could not efficiently and accurately determine these parameters from in-situ tumor RNA. As we found no preference in integration along these axes of variation (which we acknowledged openly in the manuscript as being expected when using RNA rather than DNA), we deliberately chose not to present these results as a main finding and included them in supplemental results for the sake of completeness.

    We also agree that HPV co-infection in cervical lesions is not per-say a novel finding, although to be clear most literature focuses on side-by-side infections of HPV with another virus (HHV, EBV, HIV, etc.), or uses the term to describe groupings of sub-variants or isolates under the same viral genotype header (Mirabello et al. (2016)). Additionally, most of the literature focuses on HPV co-infection in cervical neoplasia or high-grade lesions and cervical cancer risk (Chaturvedi et al. (2011); Senapati et al. (2017)) rather than assessing HPV co-infection in the tumoral tissue itself, post oncogenesis. As such, we believe that our approach at looking at in situ cervical tumor infections and the relatively high rate of HPV co-infections we observe does merit particular notice compared with previous studies. Furthermore, the analyses linking this cross-genotype co-infection phenotype with tumor gene expression, survival adjusted for major known clinical covariates, and tumor immunogenicity measures has not been reported elsewhere to our knowledge.

    For HPV, viral exon-level RNASeq analysis is irrelevant because HPV gene expression is polycistronic and is subject to changes by random viral integration events in individual cases. Therefore, it is unlikely that general overall viral gene expression signatures will be diagnostic besides, from multiple studies we understand that what matters in cervical cancer is the level of expression of the E6/E6 isoforms/E7 oncogenes.

    We agree that the post-transcriptional polycistronic nature of HPV expression makes it difficult to elucidate the effect of differing HPV gene-level expression on ultimate HPV gene translation and protein expression. However, our related yet distinct question here is on the effect HPV genotype and cancer type has on HPV gene transcriptional differences (as seen in Figure 7), so we believe we are within the limits of reasonable interpretation. Additionally, while E6 and E7 expression are well known to drive oncogenesis, it seems crucial to quantify the expression of these viral oncogenes across viral genotype and tissue type, which has not been done previously to our knowledge. Finally, even if we somehow accept that the average tumoral viral gene exon expression itself is best described as a random variable, which we do not, it remains to be explained why we observe and report persistent genotype-specific expression patterns across completely different cell-types.

    The references chosen for the HPV part of the study are either rather out of date or not representative of the extensive literature.

    We acknowledge that we have cited only a portion of the vast HPV-related cancer literature, so we have made an effort to include more recent surveys and studies as references.

    ###Reviewer #2:

    1. The authors comment that averaged infection phenotypes such as viral load or predominant genotype may be replaced by more granular measures, such exon-level viral expression or the ratio of expressed viral genotypes. In reality, viral expression, and the ratio of expressed viral genotypes, are still 'tumor averages' in the way that the authors have analysed them. HP associated tumors are heterogeneous, and without in situ analysis, it is hard to discern which transcripts are involved in driving the cancer phenotype, and which are found in associated precancerous tissue.

    We concede that the viral genotypes quantified by our method represent a computed average measure across the tumor, as would any measurement of any quantity in a bulk sequencing assay. However, the information provided by the admixture of genotypes and exon-level viral expression does provide an additional measure of granularity over previous bulk measures, and allows additional analyses not explored previously to our work. To make a comparison, this criticism could identically apply to cell-type decomposition algorithms like Cibersort, which despite their problems and inherent limitations do provide insightful information. We agree with the reviewer that with more targeted in situ analyses would allow for a truly specific association of particular viral transcripts with tumor phenotype, and would serve as a useful validation of some of our results, but this certainly does not invalidate the tumor aggregated genotype and co-infection presence associations we present here. We agree with the reviewer that multiple biopsies would allow for intra-tumoral heterogeneity to be taken into account in our study, however no major public resources (e.g. TCGA) include such data and we believe that such an undertaking lies out of any reasonable scope of this work.

    1. The authors use the term co-infection quite widely. For HPV, previous studies have shown that coinfection within cells in an individual cancer or neoplasia is rare, although independent infections by different HPV types can occur side-by-side. I expect something similar with HBV, although the study would need a higher level of analysis to establish this. The use of terminology, and the way in which data is interpreted, needs to be much more rigorous.

    We agree with the reviewer that the use of ‘co-infection’ in this context is unclear, as co-infection on a cellular level with two different HPV/HBV genotypes is impossible to determine by bulk RNA sequencing analysis. We will clarify ‘co-infection’ as strictly a mixture of independent HPV infections contained in the same tumor tissue.

    We will clearly define our meaning of ‘co-infection’ in the introduction as the aggregated mixture of HPV genotypes expressed in the tumor tissue (‘side-by-side’ infections), to remove ambiguity as to our cohort characterization.

    1. Viral load is generally used in the field as a measure of viral genome or genome-fragment abundance. This is already a misuse of the terminology, as the term implies virus numbers, or even infectious virus numbers. Here the term is used to refer to viral transcript abundance. The authors need to say precisely what they're measuring, and need to be aware that they are measuring the average across a heterogeneous tumour, which may have areas of high grade neoplasia, cancer, and even low-grade neoplasia. My feeling is that the level of analysis is too great, given the uncertainties regarding the heterogeneous nature of tissue that is being analysed, and the different cells with different levels of viral gene expression that are most likely present.

    We agree that as the reviewer frames it, our use of ‘viral load’ should be clarified as ‘viral transcript abundance’ as determined from the tumor RNASeq data in variance-stabilized units of log2 counts per million reads mapped across the viral contig. We do note however that it has been previously indicated that levels of viral transcripts do correlate well with virus numbers in infected tissue. Concerning the last comment of the reviewer, we wish to point out that our analysis goes no further in either analytic complexity nor in drawing inference from expression data than any published other study based on tumor bulk RNA-sequencing data. All samples will contain a mixture of cells and we emphasize that we are only measuring average signals, viral or host tumor specific, across this mixture.

    To address these comments we will change all references to viral load to normalized viral transcript abundance, to remove ambiguity. We can once again emphasize that our conclusions hold only in a strict averaged sense.

    1. Several of the figures don't obviously support the conclusions. For instance, it is not clear how the data shown in figure S2 supports the title of the S2 figure legend. Surely some statistical analysis is needed to support the conclusion stated in the legend. Given previous studies, I'm not at all convinced that the distribution of causative HPV genotypes is the same between SCC and Adenocarcinoma. An additional limitation of these large cancer association studies, comes from limitations in pathology diagnosis, which cannot always accurately distinguish borderline SCC/adenocarcinoma cases. With the large-scale transcriptional analysis, maybe the authors can use molecular information available in their samples to look at this.

    As the reviewer points out, we agree the statistical evidence backing our claim of no association between cervical histology and HPV infection genotype or co-infection should be added. This calculation was actually carried out and only reported in the text, but we will amend the figure to include the results and apologize for this key omission. We also note in passing that we are not making any claims about ‘causative’ HPV genotypes for the respective subtypes, but rather much more conservative statements about association. Concerning the reviewer’s concern about the quality of the phenotypic data reported in the TCGA, we heartily agree but are unable to really do much else. Indeed, concerning the last interesting comment about utilizing molecular information in our samples to distinguish SCC/adenocarcinoma subtypes, we did not find reliable gene expression signatures which could be used to validate or correct the phenotypic results.

    We will add in the spearman correlation rho and test significance results for the correlation between cervical cancer histological type and both viral phenotypes represented in figure S2.

    1. The APOBEC analysis is quite rudimentary in the text, and does not discuss the different members of the APOBEC family. Similarly, the different effects of single and multiple HPV infections on the IFR3 responsive genes is poorly developed at the biological level, which most probably reflects the general way in which the utility of the approach.

    We agree with the reviewer that our APOBEC expression analysis in the HPV+ cervical cohort could be more comprehensive, and therefore the interpretations of the results may be too far reaching. We believed the initial result to be of sufficient interest in the context of a very similar result from Zapatka et. al (2020), but concede it may make more sense as a supplemental result alone without additional evaluation or discussion of the greater APOBEC family. Additionally, the pathway analysis involving the differentially expressed genes from the co-infected and non-coinfected cervical tumors most likely should be moved to a supplemental result as well without further analyses to support the enrichment trends, following how we reported the HBV associated liver cancer co-infection DEG results (figure S5).

    We will move Figure 3d to a supplemental figure, and limit our comments in the results to just an observation in reference to Zapatka et. al., and delete any associated interpretation. We will move Figure 3c to a new supplemental figure as well, and remove the suggestion of expanded antiviral activation in co-infected tumors.

  2. ###Reviewer #2:

    The title of the manuscript suggests a detailed analysis of cancers using in situ gene expression approaches, which aims to provide new insight into tumour heterogeneity and co-infection. The manuscript is in fact an analysis of viral transcription and the presence of cellular mutations in a collection of tumours associated with HPV and HBV infection. Much of the starting data for the analysis has been drawn from the TCGA database. It is a little unclear as to whether the authors are pitching this paper as a methodological development manuscript, but I think that this is what it is at its heart. The ability to deconvolute RNA sequencing data from virus-associated tumours is interesting, and could be widely used as a research tool. However, much of the manuscript is concerned with interpreting the data, and I think the interpretation goes well beyond what can feasibly be achieved from the analysis of transcripts in extracts of total tumour tissue. The authors term 'co-infection' most likely refers to heterogeneous mixtures of viral infected cells which are competing with each other in the tumour. In my view, the biological interpretations are not particularly useful at the level that they are presented, but could serve as the starting point for future research. This manuscript could be repackaged as a description of a new analytical tool, or the most exciting aspects drawn out with the addition of biological studies to explain what the transcriptional analysis may mean. This would be a complex process, and would be facilitated by focus on either HPV or HBV, as trying to extend conclusions to the two disparate virus families in one manuscript is probably unrealistic. Without any analysis of tumour tissue using in situ analysis or single cell sequence analysis, or a combination of the two, there is little new information that can be drawn regarding the biology of disease development. My suggestion would be to repackage this as an analytical methodology publication, rather than a biology discovery manuscript.

    1. The authors comment that averaged infection phenotypes such as viral load or predominant genotype may be replaced by more granular measures, such exon-level viral expression or the ratio of expressed viral genotypes. In reality, viral expression, and the ratio of expressed viral genotypes, are still 'tumor averages' in the way that the authors have analysed them. HP associated tumors are heterogeneous, and without in situ analysis, it is hard to discern which transcripts are involved in driving the cancer phenotype, and which are found in associated precancerous tissue.

    2. The authors use the term co-infection quite widely. For HPV, previous studies have shown that coinfection within cells in an individual cancer or neoplasia is rare, although independent infections by different HPV types can occur side-by-side. I expect something similar with HBV, although the study would need a higher level of analysis to establish this. The use of terminology, and the way in which data is interpreted, needs to be much more rigorous.

    3. Viral load is generally used in the field as a measure of viral genome or genome-fragment abundance. This is already a misuse of the terminology, as the term implies virus numbers, or even infectious virus numbers. Here the term is used to refer to viral transcript abundance. The authors need to say precisely what they're measuring, and need to be aware that they are measuring the average across a heterogeneous tumour, which may have areas of high grade neoplasia, cancer, and even low-grade neoplasia. My feeling is that the level of analysis is too great, given the uncertainties regarding the heterogeneous nature of tissue that is being analysed, and the different cells with different levels of viral gene expression that are most likely present.

    4. Several of the figures don't obviously support the conclusions. For instance, it is not clear how the data shown in figure S2 supports the title of the S2 figure legend. Surely some statistical analysis is needed to support the conclusion stated in the legend. Given previous studies, I'm not at all convinced that the distribution of causative HPV genotypes is the same between SCC and Adenocarcinoma. An additional limitation of these large cancer association studies, comes from limitations in pathology diagnosis, which cannot always accurately distinguish borderline SCC/adenocarcinoma cases. With the large-scale transcriptional analysis, maybe the authors can use molecular information available in their samples to look at this.

    5. The APOBEC analysis is quite rudimentary in the text, and does not discuss the different members of the APOBEC family. Similarly, the different effects of single and multiple HPV infections on the IFR3 responsive genes is poorly developed at the biological level, which most probably reflects the general way in which the utility of the approach.

  3. ###Reviewer #1:

    This study is an in silico analysis of data from the Cancer Genome Atlas (TCGA) on hepatitis B virus (HBV)-positive liver tumours and human papillomavirus (HPV)-positive cervical and head and neck tumours and association with viral load, genotytpe(s) and expression. It is unclear to me the rationale behind including two unrelated DNA tumour viruses in the study, especially as the number of HBV-positive samples is much less than for HPV. Overall the manuscript seems to be a validation of a bioinformatic tool rather than reporting significant research findings.

    Use of the TCGA has allowed analysis of a reasonably large number of RNASeq data sets. However, once the authors drill down to individual genotypes, numbers become quite small, which may compromise some of the observation. For example, the large discrepancy between numbers of HPV16 (173) and 18(39)-positive cases makes it difficult to make firm conclusions about the significance of differentially expressed cellular genes for each set of cancers. Similarly, in Figures 4 and 6 they compare HPV18 (23 cases) with HPV45 (39 cases) and HPV18/45 coinfections (number not stated but likely far fewer).

    Much of the information that they derive from their analyses is not novel. For example, they report no preferential sites of HPV integration. Despite what they claim, quite a bit is known about HPV co-infection in cervical cancers and it is not uncommon but varies according to geographical regions, which was not a variable they used.

    For HPV, viral exon-level RNASeq analysis is irrelevant because HPV gene expression is polycistronic and is subject to changes by random viral integration events in individual cases. Therefore, it is unlikely that general overall viral gene expression signatures will be diagnostic besides, from multiple studies we understand that what matters in cervical cancer is the level of expression of the E6/E6 isoforms/E7 oncogenes.

    However, such an in silicio approach to quantify various aspects of virus-associated tumours could be a useful prognostic clinical tool in the future.

    The references chosen for the HPV part of the study are either rather out of date or not representative of the extensive literature.

  4. ##Preprint Review

    This preprint was reviewed using eLife’s Preprint Review service, which provides public peer reviews of manuscripts posted on bioRxiv for the benefit of the authors, readers, potential readers, and others interested in our assessment of the work. This review applies only to version 1 of the manuscript. Margaret Stanley (University of Cambridge) served as the Reviewing Editor.

    ###Summary:

    The reviewers agree that the study is technically impressive but the biological data generated is not particularly novel and there are criticisms of the interpretation of the data. The study may have value as a methodological and bioinformatics tool.