TF-Prioritizer: a java pipeline to prioritize condition-specific transcription factors

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Eukaryotic gene expression is controlled by cis-regulatory elements (CREs), including promoters and enhancers, which are bound by transcription factors (TFs). Differential expression of TFs and their binding affinity at putative CREs determine tissue- and developmental-specific transcriptional activity. Consolidating genomic data sets can offer further insights into the accessibility of CREs, TF activity, and, thus, gene regulation. However, the integration and analysis of multi-modal data sets are hampered by considerable technical challenges. While methods for highlighting differential TF activity from combined chromatin state data (e.g., ChIP-seq, ATAC-seq, or DNase-seq) and RNA-seq data exist, they do not offer convenient usability, have limited support for large-scale data processing, and provide only minimal functionality for visually interpreting results.

Results

We developed TF-Prioritizer, an automated pipeline that prioritizes condition-specific TFs from multi-modal data and generates an interactive web report. We demonstrated its potential by identifying known TFs along with their target genes, as well as previously unreported TFs active in lactating mouse mammary glands. Additionally, we studied a variety of ENCODE data sets for cell lines K562 and MCF-7, including twelve histone modification ChIP-seq as well as ATAC-seq and DNase-seq datasets, where we observe and discuss assay-specific differences.

Conclusion

TF-Prioritizer accepts ATAC-seq, DNase-seq, or ChIP-seq and RNA-seq data as input and identifies TFs with differential activity, thus offering an understanding of genome-wide gene regulation, potential pathogenesis, and therapeutic targets in biomedical research.

Article activity feed

  1. Background Eukaryotic gene expression is controlled by cis-regulatory elements (CREs), including promoters and enhancers, which are bound by transcription factors (TFs). Differential expression of TFs and their binding affinity at putative CREs determine tissue- and developmental-specific transcriptional activity. Consolidating genomic data sets can offer further insights into the accessibility of CREs, TF activity, and, thus, gene regulation. However, the integration and analysis of multi-modal data sets are hampered by considerable technical challenges. While methods for highlighting differential TF activity from combined chromatin state data (e.g., ChIP-seq, ATAC-seq, or DNase-seq) and RNA-seq data exist, they do not offer convenient usability, have limited support for large-scale data processing, and provide only minimal functionality for visually interpreting results.Results We developed TF-Prioritizer, an automated pipeline that prioritizes condition-specific TFs from multi-modal data and generates an interactive web report. We demonstrated its potential by identifying known TFs along with their target genes, as well as previously unreported TFs active in lactating mouse mammary glands. Additionally, we studied a variety of ENCODE data sets for cell lines K562 and MCF-7, including twelve histone modification ChIP-seq as well as ATAC-seq and DNase-seq datasets, where we observe and discuss assay-specific differences.Conclusion TF-Prioritizer accepts ATAC-seq, DNase-seq, or ChIP-seq and RNA-seq data as input and identifies TFs with differential activity, thus offering an understanding of genome-wide gene regulation, potential pathogenesis, and therapeutic targets in biomedical research.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad026 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer Kaixuan Luo

    This paper develops a novel pipeline TF-Prioritizer to prioritize condition-specific TFs thorough integrative analysis of histone modification (HM) ChIP-seq and RNA-seq data. The pipeline integrates multiple computational tools: calculate TF binding site affinities and link candidate binding sites to genes using the TRAP and TEPIC. It uses DYNAMITE, a sparse logistic regression classifier, to infer TFs related to differential gene expression between conditions. It computes an aggregated score "TF-TG score" to score TFs from multiple types of evidence, and obtains a prioritized list of TFs from all histone modifications using a discounted cumulative gain ranking approach. It also provides additional functionality and web interface to visualize the results.

    Overall, the pipeline could be very useful for biologists with a user-friendly web application to automate the entire process from data preprocessing to statistical analysis and obtain interactive reports to gain novel biological insights. However, more systematic evaluations are needed to demonstrate the benefits of this pipeline.

    Major comments:

    1. In the computation of an aggregated score "TF-TG score", it uses a multiplicative function to combine differential expression (absolute log2FC), TF-Gene scores computed from TEPIC, and the total coefficients computed from DYNAMITE. One concern about this approach is that it may miss some TFs with support from only one or two types of evidence. In Fig 5, we see diffTF identifies a lot more TFs than diffTF. I don't think we can conclude that diffTF is less specific than TF-Prioritizer simply based on the number of TFs prioritized. Some of the TFs identified only by diffTF may be important but missed by TF-Prioritizer? I would like to see more detailed analysis comparing the lists of TFs identified by diffTF and TF-Prioritizer. Other evidence or metrics in addition to the number of prioritized TFs would be helpful to evaluate the plausibility of the prioritized lists of TFs.

    2. It is hard to interpret and evaluate the contribution of the evidence for prioritized TFs. Figure 6b is helpful, but it is unclear how the users would be able to evaluate the contribution of the components. Does the software run each of the combination separately and outputs a list of prioritized TFs under each combination?

    3. The TEPIC2 paper has already developed a very comprehensive pipeline, including TF affinity calculation by TRAP and computation of TF gene scores by TEPIC, as well as logistic regression to identify TFs between conditions by DYNAMITE, and it is already well paralyzed. The authors should clearly list the novel contributions from this work. It would be helpful to have a table comparing the functionalities and technical features between TF-Prioritizer and TEPIC2.

    4. The software takes histone modification ChIPseq and RNA-seq data as input. It will significantly improve the usage of the software if it supports DNase-seq and/or ATAC-seq, which are widely used. If this software could take ATAC-seq or DNase-seq data as input, it is important to include those data types and provide some examples to illustrate the usage and performance.

    5. The software combines multiple histone modification ChIP-seq datasets using a discounted cumulative gain ranking approach. However, different types of histone modifications have different epigenomic functions and different combinations indicate different chromatin states. Some TFs may be only enriched in a small subset of histone modifications (already discussed by the authors) and may be missed by the simple discounted cumulative gain ranking approach. The authors should provide prioritized TFs from each histone modification ChIP-seq dataset, and evaluate which TFs were prioritized by all the combined datasets, and which TFs by only one dataset. Also, some ChIP-seq datasets may be of poor quality. Does the software provide other options to rank the TFs from different epigenomic datasets? e.g. set different weights for different epigenomic datasets, etc.

    6. The authors conducted cooccurrence analysis based on the overlapping of peaks. It is unclear if the method would calculate some statistical measure (e.g. p-value) for the significance of co-occurrence. Also, since the TRAP model generates quantitative measure of TF binding affinity, I am curious to see if the quantitative TF binding affinity are also correlated for those co-occurred binding sites.

    Minor comments:

    1. In Figure 1, it would be helpful to highlight which steps were already implemented in existing tools (and label the tools used), and which steps are novel in this study.
    2. H3K4me3 data seems to be missing in the L10 time point. How does the method handle missing data?
    3. It is unclear how the Pol2 ChIP-seq data was used in this study? Was it included in the model or only in the downstream analysis?
    4. It is hard to interpret the browser tracks of the TF predictions ("Predicted xxx") in Figure 3 and 4. Please add more details about those tracks .5. Figure 6, the authors should provide more details to help understand this figure, especially panel b. The figure legend is too short.
  2. Background Eukaryotic gene expression is controlled by cis-regulatory elements (CREs), including promoters and enhancers, which are bound by transcription factors (TFs). Differential expression of TFs and their binding affinity at putative CREs determine tissue- and developmental-specific transcriptional activity. Consolidating genomic data sets can offer further insights into the accessibility of CREs, TF activity, and, thus, gene regulation. However, the integration and analysis of multi-modal data sets are hampered by considerable technical challenges. While methods for highlighting differential TF activity from combined chromatin state data (e.g., ChIP-seq, ATAC-seq, or DNase-seq) and RNA-seq data exist, they do not offer convenient usability, have limited support for large-scale data processing, and provide only minimal functionality for visually interpreting results.Results We developed TF-Prioritizer, an automated pipeline that prioritizes condition-specific TFs from multi-modal data and generates an interactive web report. We demonstrated its potential by identifying known TFs along with their target genes, as well as previously unreported TFs active in lactating mouse mammary glands. Additionally, we studied a variety of ENCODE data sets for cell lines K562 and MCF-7, including twelve histone modification ChIP-seq as well as ATAC-seq and DNase-seq datasets, where we observe and discuss assay-specific differences.Conclusion TF-Prioritizer accepts ATAC-seq, DNase-seq, or ChIP-seq and RNA-seq data as input and identifies TFs with differential activity, thus offering an understanding of genome-wide gene regulation, potential pathogenesis, and therapeutic targets in biomedical research.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad026), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer: Roza Berhanu Lemma

    In this manuscript, Hoffmann and Trummer et al. reported a new automated pipeline that utilizes existing methods, namely (1) DESeq2 to perform differential gene expression between sample groups, (2) TEPIC, a method that links CREs to genes using a biophysical model TRAP and (3) DYNAMITE, which provides an aggregate score for TF-target genes that determine the contribution of TFs to condition specific changes between sample groups. Finally, the pipeline utilizes Mann-Whitney U test to prioritize TFs among a background distribution and a ChIP-seq specific TF distribution, which allows the identification of TFs with roles in condition-specific gene regulation. Their pipeline allows large-scale processing of data and returns a feature-rich and user-friendly interactive report.

    The authors demonstrated how to use TF-prioritizer using public datasets for mouse mammary gland development study and performed independent validation using datasets from ChIP-Atlas. They were able to capture both known TFs with previously reported roles in mammary gland development/lactation and new TFs that may have a role in these processes. The work is very well thought and executed but to keep the quality of the work even higher, the authors should address the following points.

    Major:

    1. Although their validation nicely portrays the potential application of their pipeline in answering biological questions, my fear is for this not to be an isolated case. Therefore, the authors should test their pipeline using another example dataset and convince their readers. A suggestion could be, to run TF-Prioritizer on one of deeply profiled cell lines (e.g. K562, MCF-7, etc) to investigate TF prioritizations for e.g during differentiation (change of cell fate) and see if lineage determining TFs are prioritized in such cases. This may potentially highlight the versatility and robustness of TF-prioritizer. This is also important as your readers are not (certainly not all of them) from the mammary gland development field. As such, dedicating a large portion of your discussion about this process is too much. If you manage to highlight the versatility of your pipeline by capturing more than one specific developmental process will do the paper a great favor by highlighting the different ways TF-Prioritizer can be used, which in turn may attract more users to utilize your pipeline.

    2. I have an issue on how the 'Results and Discussion' section is organized. The authors dedicated separate subtopics for each TFs they prioritized and made literature review of their role in mammary gland development and lactation. My recommendation is to instead have one subtopic and discuss these TFs paragraph by paragraph in a concise manner. A more concrete way to reorganize this will be to separate these into two subtopics, (1) Known TFs with role in mammary gland development/lactation (2) Novel TFs with predicted role in mammary gland development/lactation. To make these reorganization easier/smooth, cutdown details of what you observe in the figures (e.g. p16, line 22-27 and p17, line 1-3), discuss the main message and put the detailed text about the figures in the Figure captions

    .3. All figures and tables should have more information in the caption including those in 'supplementary Material'Minor:1. p7 line 9, how often do one find these combinations of data types (modalities) in different conditions, cell types or models being studied. Could some of the HMs be replaced with other data modalities e.g ATAC-seq, DHS data or data from other chromosome profiling methods? Could the pipeline be adapted to incorporate Cut and tag/cut and run or is it specific to only ChIP-seq data. Authors should try to discuss whether this is possible or not.2. P13 line 3, the authors discuss that "ChIP-Atlas provides more than 362,121 datasets for six model organisms…". Could TF-Priotitizer be easily adapted to other databases/resources, which ChIP-Atlas do not cover (e.g. for other organisms) that the community might be interested in?3. p14 line 2 "... expressed gene for this analysis but focus on affinities only". Why this is the case is not argued/discussed. This and other choice of parameters would be nice if they are discussed under a separate subtopic to easily inform future readers/users of TF-Priotitizer

    1. Figures should be cited in chronological order. Adjust the text or reorder the figures

    2. When the authors discuss the evaluation of the prioritized TFs in separate sections, they often start with "In Figure Xa) …" and "Figure Yc) shows that …", etc, such kind of texts best fit as Figure captions instead of in the 'Results and Discussion'.

    3. p21 line 16, "We predicted that several Rho GTPase-associated genes are regulated by the predicted TFs" This sentence sounds a bit circular, you may rephrase as follows 'We propose that our predicted TFs regulate several Rho GTPase-associated genes

    '7. Figure 3 and 4 have the same general message/purpose and look redundant. This is reflected in the phrase '...(black arrows) as they are already known to be crucial in either mammary gland development or lactation.' and 'In the heatmaps, we can observe a clear separation of these target genes between the time points X and Y…'. I suggest the authors choose one of them as a main figure and place the other in Supplementary Material.

    1. On Fig.3,4 captions the authors should indicate what the black boxes represent. One can guess what they are from your main text but the captions could profit from a bit more detailed explanation. You should at-least describe some of the things that needs to be highlighted from the figures to easily guide your readers
  3. Background Eukaryotic gene expression is controlled by cis-regulatory elements (CREs), including promoters and enhancers, which are bound by transcription factors (TFs). Differential expression of TFs and their binding affinity at putative CREs determine tissue- and developmental-specific transcriptional activity. Consolidating genomic data sets can offer further insights into the accessibility of CREs, TF activity, and, thus, gene regulation. However, the integration and analysis of multi-modal data sets are hampered by considerable technical challenges. While methods for highlighting differential TF activity from combined chromatin state data (e.g., ChIP-seq, ATAC-seq, or DNase-seq) and RNA-seq data exist, they do not offer convenient usability, have limited support for large-scale data processing, and provide only minimal functionality for visually interpreting results.Results We developed TF-Prioritizer, an automated pipeline that prioritizes condition-specific TFs from multi-modal data and generates an interactive web report. We demonstrated its potential by identifying known TFs along with their target genes, as well as previously unreported TFs active in lactating mouse mammary glands. Additionally, we studied a variety of ENCODE data sets for cell lines K562 and MCF-7, including twelve histone modification ChIP-seq as well as ATAC-seq and DNase-seq datasets, where we observe and discuss assay-specific differences.Conclusion TF-Prioritizer accepts ATAC-seq, DNase-seq, or ChIP-seq and RNA-seq data as input and identifies TFs with differential activity, thus offering an understanding of genome-wide gene regulation, potential pathogenesis, and therapeutic targets in biomedical research.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad026 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer Xiaowo Wang : Markus et al. developed a new pipeline TF-Prioritizer to discover potential cell or tissue-specific transcription factors (TF) with ChIP-seq data of histone modification and RNA-seq data. TF-Prioritizer is mainly based on the framework of the state-of-art method TEPIC to model TFs regulating the gene. The authors extend TEPIC by integrating more information like differential gene expression using DEseq and linking the TF binding in cis-regulatory element to the gene expression using DYNAMITE. They also designed a new statistical method to rank the TFs across different cell types or in the time-serious cells. The authors also provide some cases to validate the pipeline. The pipeline is useful in biomedical research. The manuscript is well-written and provides enough details. The authors addressing or further considering the following issues may benefit readers.1. TF-Prioritizer requires ChIP-seq of histone modification (HM) as the input. It may support different types of HM. Users may want to know how to choose a proper set of HMs? Authors should evaluate some cases to show TF-Prioritizer's performance when inputting different HMs.2. ATAC-seq is more widespread for different kinds of cells or tissues. It seems TF-Prioritizer can also apply to ATAC-seq peaks. Why TF-Prioritizer does not support ATAC-seq now?3. On page 11, there may be some mistakes in the definition of BG(m) and FG(t,m). t \in TF(m) of BG(m) should be moved to FG(t,m)?4. The software is hard to install without sudo/root account. It would be better to provide a docker image that is ready for the users to run the software.