Interrogating the precancerous evolution of pathway dysfunction in lung squamous cell carcinoma using XTABLE

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    The authors have developed a useful and user-friendly software to analyse gene expression data from four datasets representing premalignant lung lesions. This software would be of interest to those working in lung cancer and specifically the pre-malignant space. The major strength is the ease of use while the major limitation is the inability for the user to integrate other datasets.

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Lung squamous cell carcinoma (LUSC) is a type of lung cancer with a dismal prognosis that lacks adequate therapies and actionable targets. This disease is characterized by a sequence of low- and high-grade preinvasive stages with increasing probability of malignant progression. Increasing our knowledge about the biology of these premalignant lesions (PMLs) is necessary to design new methods of early detection and prevention, and to identify the molecular processes that are key for malignant progression. To facilitate this research, we have designed XTABLE (E x ploring T r a nscriptomes of B ronchial L esions), an open-source application that integrates the most extensive transcriptomic databases of PMLs published so far. With this tool, users can stratify samples using multiple parameters and interrogate PML biology in multiple manners, such as two- and multiple-group comparisons, interrogation of genes of interests, and transcriptional signatures. Using XTABLE, we have carried out a comparative study of the potential role of chromosomal instability scores as biomarkers of PML progression and mapped the onset of the most relevant LUSC pathways to the sequence of LUSC developmental stages. XTABLE will critically facilitate new research for the identification of early detection biomarkers and acquire a better understanding of the LUSC precancerous stages.

Article activity feed

  1. eLife assessment

    The authors have developed a useful and user-friendly software to analyse gene expression data from four datasets representing premalignant lung lesions. This software would be of interest to those working in lung cancer and specifically the pre-malignant space. The major strength is the ease of use while the major limitation is the inability for the user to integrate other datasets.

  2. **Reviewer #1 (Public Review):
    **
    Roberts et al have developed a tool called "XTABLE" for the analysis of publicly available transcriptomic datasets of premalignant lesions (PML) of lung squamous cell carcinoma (LUSC). Detection of PMLs has clinical implications and can aid in the prevention of deaths by LUSC. Hence efforts such as this will be of benefit to the scientific community in better understanding the biology of PMLs.

    The authors have curated four studies that have profiled the transcriptomes of PMLs at different stages. While three of them are microarray-based studies, one study has profiled the transcriptome with RNA-seq. XTABLE fetches these datasets and performs analysis in an R shiny app (a graphical user interface). The tool has multiple functionalities to cover a wide range of transcriptomic analyses, including differential expression, signature identification, and immune cell type deconvolution.

    The authors have also included three chromosomal instability (CIN) signatures from literature based on gene expression profiles. They showed one of the CIN signatures as a good predictor of progression. However, this signature performed well only in one study. The authors have further utilised the tool XTABLE to identify the signalling pathways in LUSC important for its developmental stages. They found the activation of squamous differentiation and PI3K/Akt pathways to play a role in the transition from low to high-grade PMLs

    The authors have developed user-friendly software to analyse publicly available gene expression data from premalignant lesions of lung cancer. This would help researchers to quickly analyse the data and improve our understanding of such lesions. This would pave the way to improve early detection of PMLs to prevent lung cancer.

    Strengths:

    1. XTABLE is a nicely packaged application that can be used by researchers with very little computational knowledge.
    2. The tool is easy to download and execute. The documentation is extensive both in the article and on the GitLab page.
    3. The tool is user-friendly, and the tabs are intuitively designed for successive steps of analysis of the transcriptome data.
    4. The authors have properly elaborated on the biological interest in investigating PMLs and their clinical significance.

    Weaknesses:

    The article is focused on the development and the utility of the tool XTABLE. While the tool is nicely developed, the need for a tool focussing only on the investigation of PMLs is not justified. Several shiny apps and online tools exist to perform transcriptomic analysis of published datasets. To list a few examples - i) http://ge-lab.org/idep/ ; ii) http://www.uusmb.unam.mx/ideamex/ ; iii) RNfuzzyApp (Haering et al., 2021); iv) DEGenR (https://doi.org/10.5281/zenodo.4815134); v) TCC-GUI (Su et al., 2019). While some of these are specific to RNA-seq, there are plenty of such shiny apps to perform both RNA-seq and microarray data analysis. Any of these tools could also be used easily for the analysis of the four curated datasets presented in this article. The authors could have elaborated on the availability of other tools for such analysis and provided an explanation of the necessity of XTABLE. Since 3 of the 4 datasets they curated are from microarray technology, another good example of a user-friendly tool is NCBI GEO2R. This is integrated with the NCBI GEO database, and the user doesn't need to download the data or run any tools. iDEP-READS (http://bioinformatics.sdstate.edu/reads/) provide an online user-friendly tool to download and analyse data from publicly available datasets. Another such example is GEO2Enrichr (https://maayanlab.cloud/g2e/). These tools have been designed for non-bioinformatic researchers that don't involve downloading datasets or installing/running other tools.

    Secondly, XTABLE doesn't provide a solution to integrate the four datasets incorporated in the tool. One can only analyse one dataset at a time with XTABLE. The differences in terms of methodology and study design within these four datasets have been elaborated on in the article. However, attempts to integrate them were lacking.

    The tool also lacks the flexibility for users to add more datasets. This would be helpful when there are more datasets of PMLs available publicly.

    Understanding the biology of PML progression would require a multi-omics approach. XTABLE analyses transcriptome data and lacks integration of other omics data. The authors mention the availability of data from whole exome, methylation, etc from the four studies they have selected. However, apart from the CIN scores, they haven't integrated any of the other layers of omics data available.

    Lastly, the authors could have elaborated on the limitations of the tool and their analysis in the discussion.

  3. Reviewer #2 (Public Review):

    In this manuscript, Roberts et al. present XTABLE, a tool to integrate, visualise and extract new insights from published datasets in the field of preinvasive lung cancer lesions. This approach is critical and to be highly commended; whilst the Cancer Genome Atlas provided many insights into cancer biology it was the development of accessible visualisation tools such as cbioportal that democratised this knowledge and allowed researchers around the world to interrogate their genes and pathways of interest. XTABLE is trying to do this in the preinvasive space and should certainly be commended as such. We are also very impressed by the transparency of the approach; it is quite simple to download and run XTABLE from their Gitlab account, in which all data acquisition and analysis code can be easily interrogated.

    We would however strongly advocate deploying XTABLE to a web-accessible server so that researchers without experience in R and git can utilise it. We found it a little buggy running locally and cannot be sure whether this is due to my setup or the code itself. Some issues clearly need development; Progeny analysis brings up a warning "Not working for GSE109743 on the server and not sure why". GSEA analysis does not seem to work at all, raising an error "Length information for genome hg38 and gene ID ensGene is not available". In such relatively complex software, some such errors can be overlooked, as long as the authors have a clear process for responding to them, for example using Gitlab issue reporting. Some acknowledgement that this is an ongoing development would be helpful.

    The authors discuss some very important differences between the datasets in the text. Most notably they differ in endpoints and in the presence of laser capture. We would advocate including some warning text within the XTABLE application to explain these. For example, the "persistent/progressive" endpoint used in Beane et al (next biopsy is the same or higher grade) is not the same as the "progressive" endpoint in Teixeira et al (next biopsy is cancer); samples defined as "persistent/progressive" may never progress to cancer. This may not be immediately obvious to a user of XTABLE who wishes to compare progressive and regressive lesions. Similarly, the use of laser capture is important; the authors state that not using laser capture has the advantage of capturing microenvironment signals, but differentiating between intra-lesional and stromal signals is important, as shown in the Mascaux and Pennycuick papers. The authors cannot do much about the different study designs, but as the goal is to make these data more accessible We think some brief description of these issues within the app would help to prevent non-expert users from drawing incorrect conclusions.

    The authors themselves illustrate this clearly in their analysis of CIN signatures in progression potential. They observe that there is a much clearer progressive/regressive signal in GSE108124 compared to GSE114489 and GSE109743. This does not seem at all surprising, since the first study used a much stricter definition of progression - these samples are all about to become cancer whereas "progressive" samples in GSE109743 may never become cancer - and are much enriched for CIN signals due to laser capture. Their discussion states "CIN scores as a predictor of progression might be limited to microdissected samples and CIS lesions"; you cannot really claim this when "progression" in the two cohorts has such a different meaning. To their credit, the authors do explain these issues but they really should be clearly spelled out within the app.

    We are not sure we agree with their analysis of CDK4/Cyclin-D1 and E2F expression in early lesions. The authors claim these are inhibited by CDKN2A and therefore are markers of CDKN2A loss of function. But these genes are markers of proliferation and can be driven by a range of proliferative processes. Histologically, low-grade metaplasias and dysplasias all represent proliferative epithelium when compared to normal control, but most never become cancer. It is too much of a leap to say that these are influenced by CDKN2A because that gene is inactivated in LUSC; do the authors have any evidence that this gene is altered at the genomic level in low-grade lesions?

    Overall this tool is an important step forwards in the field. Whilst we are a little unconvinced by some of their biological interpretations, and the tool itself has a few bugs, this effort to make complex data more accessible will be greatly enabling for researchers and so should be commended. In the future, we would like to see additional molecular data integrated into this app, for example, the whole genome and methylation data mentioned in line 153. However, we think this is an excellent start to combining these datasets.

  4. Author Response:

    **Reviewer #1 (Public Review):
    **
    Roberts et al have developed a tool called "XTABLE" for the analysis of publicly available transcriptomic datasets of premalignant lesions (PML) of lung squamous cell carcinoma (LUSC). Detection of PMLs has clinical implications and can aid in the prevention of deaths by LUSC. Hence efforts such as this will be of benefit to the scientific community in better understanding the biology of PMLs.

    The authors have curated four studies that have profiled the transcriptomes of PMLs at different stages. While three of them are microarray-based studies, one study has profiled the transcriptome with RNA-seq. XTABLE fetches these datasets and performs analysis in an R shiny app (a graphical user interface). The tool has multiple functionalities to cover a wide range of transcriptomic analyses, including differential expression, signature identification, and immune cell type deconvolution.

    The authors have also included three chromosomal instability (CIN) signatures from literature based on gene expression profiles. They showed one of the CIN signatures as a good predictor of progression. However, this signature performed well only in one study. The authors have further utilised the tool XTABLE to identify the signalling pathways in LUSC important for its developmental stages. They found the activation of squamous differentiation and PI3K/Akt pathways to play a role in the transition from low to high-grade PMLs

    The authors have developed user-friendly software to analyse publicly available gene expression data from premalignant lesions of lung cancer. This would help researchers to quickly analyse the data and improve our understanding of such lesions. This would pave the way to improve early detection of PMLs to prevent lung cancer.

    Strengths:

    1. XTABLE is a nicely packaged application that can be used by researchers with very little computational knowledge.
    2. The tool is easy to download and execute. The documentation is extensive both in the article and on the GitLab page.
    3. The tool is user-friendly, and the tabs are intuitively designed for successive steps of analysis of the transcriptome data.
    4. The authors have properly elaborated on the biological interest in investigating PMLs and their clinical significance.

    Weaknesses:

    The article is focused on the development and the utility of the tool XTABLE. While the tool is nicely developed, the need for a tool focussing only on the investigation of PMLs is not justified. Several shiny apps and online tools exist to perform transcriptomic analysis of published datasets. To list a few examples - i) http://ge-lab.org/idep/ ; ii) http://www.uusmb.unam.mx/ideamex/ ; iii) RNfuzzyApp (Haering et al., 2021); iv) DEGenR (https://doi.org/10.5281/zenodo.4815134); v) TCC-GUI (Su et al., 2019). While some of these are specific to RNA-seq, there are plenty of such shiny apps to perform both RNA-seq and microarray data analysis. Any of these tools could also be used easily for the analysis of the four curated datasets presented in this article. The authors could have elaborated on the availability of other tools for such analysis and provided an explanation of the necessity of XTABLE. Since 3 of the 4 datasets they curated are from microarray technology, another good example of a user-friendly tool is NCBI GEO2R. This is integrated with the NCBI GEO database, and the user doesn't need to download the data or run any tools. iDEP-READS (http://bioinformatics.sdstate.edu/reads/) provide an online user-friendly tool to download and analyse data from publicly available datasets. Another such example is GEO2Enrichr (https://maayanlab.cloud/g2e/). These tools have been designed for non-bioinformatic researchers that don't involve downloading datasets or installing/running other tools.

    Two of these tools (IDEP and TCC-GUI) were reviewed in a literature review covering 20 Shiny apps performed two years ago prior to work on XTABLE starting. Three of the suggested tools (IDEP, RNFuzzyApp, TCC-GUI) are for processing only RNA-seq datasets. IDEAMEX appears to be for RNA-seq data only and is severely limited in its downstream analysis capabilities. DEGenR appears to handle microarray datasets and features an option to retrieve data directly from GEO. However, it appears to be based on GEO2R (with additional downstream analyses) where it automatically logtransforms already log-transformed data and unlike GEO2R, you do not have the option to not apply a log-transformation. A refreshed literature search focusing on microarray datasets highlighted three additional tools. iGEAK which hasn’t been updated in three years and seems to have compatibility issues running on new Windows and Mac machines. sMAP, an upcoming Shiny app for microarray data published in bioRxiv on 29 May 2022. MAAP which has the same issue of log-transforming already log-transformed data. iDEP-READS does not list the datasets used in XTABLE. GEO2Enrichr appears to require the counts table and experimental design in one file, performs a “characteristic direction” DEG test and outputs enriched pathways. These apps require not just downloading of datasets but reformatting and renaming of expression data files and creation of additional files for setting up the DEG analysis which is not practical for the number of samples we have (122, 63, 33, 448) even if these apps handled microarray data. XTABLE also incorporates AUC metrics, which is appropriate given the number of samples in each dataset and tool known for adequately controlling FDR, which is not seen in other apps as well as emphasis on individual gene results and interrogation.

    A new paragraph on the discussion section (lines 361-370) of the discussion addresses the potential use of existing applications instead of XTABLE

    Secondly, XTABLE doesn't provide a solution to integrate the four datasets incorporated in the tool. One can only analyse one dataset at a time with XTABLE. The differences in terms of methodology and study design within these four datasets have been elaborated on in the article. However, attempts to integrate them were lacking.

    We repeatedly considered different strategies of integrating the analysis of the four datasets and we always reached the conclusion that it was hardly going to offer any advantage, or that it might be counterproductive.

    Integration can occur at multiple levels. One possibility is to carry out the same analysis (e.g. expression of a given gene in two groups of samples) in all datasets. Since the design and methodologies of the four studies differ substantially (different stages, different definitions of progression status, etc), a unique stratification for all datasets is not possible. Moreover, interrogating the four datasets simultaneously would slow the analysis, with no significant advantage in terms of speed. Another possibility is the integration of results in the same output. For instance, obtain a single chart with the expression of a given gene in multiple subgroups of the four datasets. We think that the results from each cohort should be kept separately and then compared with a similar analysis from other datasets due to differences in design. Scientifically, this is the best way to proceed as it avoids confusions.

    Nevertheless, XTABLE allows the export of data for further analysis. The user can use this option to integrate data using other applications or statistical packages.

    We do understand the attractiveness of integration between the four datasets is and we seriously considered it. But there is a fine balance between user-friendliness, flexibility, and scientific rigour. We think that XTABLE achieves this balance. Increasing integration of datasets might lead to error and wrong conclusions due to biological and methodological differences between studies. We believe that comparing analyses obtained independently from the four cohorts is the most sensible way to proceed.

    We propose to discuss these aspects accordingly.

    The integrative analysis of two or more datasets has been discussed in a new paragraph (382-391)

    The tool also lacks the flexibility for users to add more datasets. This would be helpful when there are more datasets of PMLs available publicly.

    This was also a permanent topic for discussion while designing XTABLE. Creating a tool that could be used to analyse other cohorts of precancerous lesions, while maintaining the ease of use was certainly a challenge. We had to adapt XTABLE to the characteristics of each one of the four databases: specific stratification criteria, different nomenclatures for the different sample types, etc. Designing a shiny app that can be adapted to other present or future datasets without the need of changing the code is simply not practical.

    The flexibility that these other Shiny apps incorporate to analyse any RNA-seq dataset requires the contrasts used for the differentially expressed gene analysis be manually defined. IDEP requires an experimental design file where sample names in the counts file must match exactly the sample names in this experimental design file and pre-processing visualisation is limited to the first 100 samples. RNFuzzyApp is similar but we could not format the experimental design file in a way that did not result in the app crashing upon upload. TCC-GUI requires all the sample names to be renamed to the contrast group with the addition of the replicate number. Apps that allow datasets to be uploaded do not have a practical or easy way to set up the DEG analysis of more than a couple dozen samples.

    Future versions of XTABLE can be updated to include additional curated PML datasets that would enhance hypothesis generation upon request. Importantly, the code is freely available and can be modified by other scientists to add their cohorts of interest, although we agree that a high level of expertise in coding will be needed. We propose to add these considerations to the text.

    The possibilities of expansion of XTABLE to new databases are discussed in lines 392-398

    Understanding the biology of PML progression would require a multi-omics approach. XTABLE analyses transcriptome data and lacks integration of other omics data. The authors mention the availability of data from whole exome, methylation, etc from the four studies they have selected. However, apart from the CIN scores, they haven't integrated any of the other layers of omics data available.

    Only one dataset (GSE108104) contains whole-exome sequencing and methylation data. We considered that a multi-omics approach in XTABLE would result in an overcomplicated application. As far as early detection and biomarker discovery is concerned, transcriptomic data is the most interesting parameter.

    Also discussed in lines 382-391

    Lastly, the authors could have elaborated on the limitations of the tool and their analysis in the discussion.

    We propose to raise these limitations accordingly in the discussion.

    See above.

    Reviewer #2 (Public Review):

    In this manuscript, Roberts et al. present XTABLE, a tool to integrate, visualise and extract new insights from published datasets in the field of preinvasive lung cancer lesions. This approach is critical and to be highly commended; whilst the Cancer Genome Atlas provided many insights into cancer biology it was the development of accessible visualisation tools such as cbioportal that democratised this knowledge and allowed researchers around the world to interrogate their genes and pathways of interest. XTABLE is trying to do this in the preinvasive space and should certainly be commended as such. We are also very impressed by the transparency of the approach; it is quite simple to download and run XTABLE from their Gitlab account, in which all data acquisition and analysis code can be easily interrogated.

    We would however strongly advocate deploying XTABLE to a web-accessible server so that researchers without experience in R and git can utilise it. We found it a little buggy running locally and cannot be sure whether this is due to my setup or the code itself. Some issues clearly need development; Progeny analysis brings up a warning "Not working for GSE109743 on the server and not sure why". GSEA analysis does not seem to work at all, raising an error "Length information for genome hg38 and gene ID ensGene is not available". In such relatively complex software, some such errors can be overlooked, as long as the authors have a clear process for responding to them, for example using Gitlab issue reporting. Some acknowledgement that this is an ongoing development would be helpful.

    We thank the reviewer for these comments. We will inspect the code to address those warnings, implement a system for issue reporting, and add the acknowledgements suggested by the reviewer. Regarding the deployment of XTABLE to a web-accessible server, this could present a challenge in the long term as computing resources need to be allocated for years and the economic cost involved.

    The code has been inspected to remove the warning and errors pointed out by the reviewer.

    The authors discuss some very important differences between the datasets in the text. Most notably they differ in endpoints and in the presence of laser capture. We would advocate including some warning text within the XTABLE application to explain these. For example, the "persistent/progressive" endpoint used in Beane et al (next biopsy is the same or higher grade) is not the same as the "progressive" endpoint in Teixeira et al (next biopsy is cancer); samples defined as "persistent/progressive" may never progress to cancer. This may not be immediately obvious to a user of XTABLE who wishes to compare progressive and regressive lesions. Similarly, the use of laser capture is important; the authors state that not using laser capture has the advantage of capturing microenvironment signals, but differentiating between intra-lesional and stromal signals is important, as shown in the Mascaux and Pennycuick papers. The authors cannot do much about the different study designs, but as the goal is to make these data more accessible We think some brief description of these issues within the app would help to prevent non-expert users from drawing incorrect conclusions.

    The authors themselves illustrate this clearly in their analysis of CIN signatures in progression potential. They observe that there is a much clearer progressive/regressive signal in GSE108124 compared to GSE114489 and GSE109743. This does not seem at all surprising, since the first study used a much stricter definition of progression - these samples are all about to become cancer whereas "progressive" samples in GSE109743 may never become cancer - and are much enriched for CIN signals due to laser capture. Their discussion states "CIN scores as a predictor of progression might be limited to microdissected samples and CIS lesions"; you cannot really claim this when "progression" in the two cohorts has such a different meaning. To their credit, the authors do explain these issues but they really should be clearly spelled out within the app.

    This is a very good point. We will add the warning text about the differences between studies regarding the definition of progression potential and the differences and sample processing (LCM or o not) so that the user is permanently aware of the differences between cohorts.

    A new tab (Dataset) has been added table with the methodologies used in each of each study, and the differences in progression status definitions. Additionally, we emphasized these differences in the main text of the manuscript (lines 296-300 and 403-409).

    We are not sure we agree with their analysis of CDK4/Cyclin-D1 and E2F expression in early lesions. The authors claim these are inhibited by CDKN2A and therefore are markers of CDKN2A loss of function. But these genes are markers of proliferation and can be driven by a range of proliferative processes. Histologically, low-grade metaplasias and dysplasias all represent proliferative epithelium when compared to normal control, but most never become cancer. It is too much of a leap to say that these are influenced by CDKN2A because that gene is inactivated in LUSC; do the authors have any evidence that this gene is altered at the genomic level in low-grade lesions?

    We are grateful for this comment. There is currently not evidence that CDKN2A mutations occur in low-grade lesions and therefore, we cannot argue that the of CDK4/Cyclin-D1 and E2F expression signature are the result of CDKN2A inactivation in low-grade lesions. We propose to modify the text to introduce these caveats to our conclusion an make our interpretations more accurate.

    We have modified the discussion (lines 443-454) to address the interpretation of our results regarding the connection between CDKN2A inactivation and the CDK4/cyclin-D1 and E2F signatures. We now focus our conclusions on the pathway itself and we mention Cyclin-D1 and CDKN2A alterations as a potential modulator of the changes in the pathway, but leaving the discussion open to other drivers.

    Overall this tool is an important step forwards in the field. Whilst we are a little unconvinced by some of their biological interpretations, and the tool itself has a few bugs, this effort to make complex data more accessible will be greatly enabling for researchers and so should be commended. In the future, we would like to see additional molecular data integrated into this app, for example, the whole genome and methylation data mentioned in line 153. However, we think this is an excellent start to combining these datasets.