Defining the Characteristics of Type I Interferon Stimulated Genes: Insight from Expression Data and Machine Learning

Abstract

A virus-infected cell triggers a signalling cascade resulting in the secretion of interferons (IFNs), which in turn induce the up-regulation of IFN-stimulated genes (ISGs) that play an important role in the inhibition of the viral infection and the return to cellular homeostasis. Here, we conduct detailed analyses on 7443 features relating to evolutionary conservation, nucleotide composition, gene expression, amino acid composition, and network properties to elucidate factors associated with the stimulation of genes in response to type I IFNs. Our results show that ISGs are less evolutionary conserved than genes that are not significantly stimulated in IFN experiments (non-ISGs). ISGs show significant depletion of GC-content in the coding region of their canonical transcripts, which leads to under-representation in the nucleotide compositions. Differences between ISGs and non-ISGs are also reflected in the properties of their coded amino acid sequence compositions. Network analyses show that ISG products tend to be involved in key paths but are away from hubs or bottlenecks of the human protein-protein interaction (PPI) network. Our analyses also show that interferon-repressed human genes (IRGs), which are down-regulated in the presence of IFNs, can have similar properties to ISGs, thus leading to false positives in ISG predictions. Based on these analyses, we design a machine learning framework integrating the usage of support vector machine (SVM) and feature selection algorithms. The ISG prediction achieves an area under the receiver operating characteristic curve (AUC) of 0.7455 and demonstrates the similarity between ISGs triggered by type I and III IFNs. Our machine learning model predicts a number of genes as potential ISGs that so far have shown no significant differential expression when stimulated with IFN in the cell types and tissue types compiled in the available IFN-related databases. A webserver implementing our method is accessible at http://isgpre.cvr.gla.ac.uk/ .

Author summary

Interferons (IFNs) are signalling proteins secreted from host cells. IFN-triggered signalling activates the host immune system in response to intra-cellular infection. It results in the stimulation of many genes that have anti-pathogen roles in host defenses. Interferon-stimulated genes (ISGs) have unique properties that make them different from those not significantly up-regulated in response to IFNs (non-ISGs). We find the down-regulated interferon-repressed genes (IRGs) have some shared properties with ISGs. This increases the difficulty of distinguishing ISGs from non-ISGs. The use of machine learning is a sensible strategy to provide high throughput classifications of putative ISGs, for investigation with in vivo or in vitro experiments. Machine learning can also be applied to human genes for which there are insufficient expression levels before and after IFN treatment in various experiments. Additionally, the interferon type has some impact on ISG predictability. We expect that our study will provide new insight into better understanding the inherent characteristics of human genes that are related to response in the presence of IFNs.

Abstract

Reviewer 1: Milton Pividori

In this manuscript, the authors analyzed different characteristics that are potentially related to the expression of human genes under IFN-a stimulation. A classification model is built to predict ISG (genes that are upregulated following IFN-a stimulation) from the human fibroblast cell. The model also performs feature selection, and the authors used different test sets (on different types of IFN) to validate their model. The authors provide a web server that implemented this machine learning model. I liked the introduction, the background and motivation were clear. However, the Results section was a bit hard to follow, in particular the implementation of the machine learning models, with different classifiers applied inconsistently across distinct features sets. At the beginning of this section, the authors perform extensive manual feature analyses across different feature types (related to alternative splicing, duplication, and mutation) to build a refined dataset. These analyses basically correlate each individual feature with the expression of genes in the presence of IFN-a. I have several concerns here, related mainly to the correlation between features, that I describe below. General comments:

Regarding reproducibility, the authors provide a Github repository with source code, the model trained and data. From the documentation and notes in the manuscript (lines 1015-1023), looks like this can only be run on mac OS, which makes it very hard for me to test (I'm a Linux user). I recommend the authors to read and follow the article "Reproducibility standards for machine learning in the life sciences" (https://doi.org/10.1038/s41592-021-01256-7). Having, for instance, a Docker image to download and run your analyses would be fantastic.
The authors perform a comprehensive analysis of features that differentiate different gene classes. I wonder why didn't they use first a machine learning model to automatically find these important features, and then try to analyze which features were selected (instead of the other way around as done in the study). I think there is perhaps too much manual feature engineering in the previous steps of training an ML model.
Related to the previous point, in my comments below one of my concerns is about feature correlation. The authors compare individual features regarding their ability to separate different gene classes (ISG vs background vs non-ISG). But one can imagine that some features are highly correlated. Some features might not be useful to separate gene classes from a single-feature analysis (as the authors do at the beginning), but they could be useful in combination with other features. Unless I'm missing an important point, I would leave the machine learning model to learn this and then analyze each feature individually after the model identifies them.
Authors are concerned that including too many features in the support vector machine (SVM) model would complicate the prediction task. To remedy this, they manually select the features according to, in my opinion, a more subjective criterion. Why didn't the authors use a feature selection algorithm here? I know that they propose a model including feature selection, but I guess I don't understand well all the previous manual feature analyses. Using a known feature selection method here would provide a more data-driven approach to improve classification, in addition to their manual expert curation (which is also valid).
They run several classification models, but not consistently across the same set of features. For example, only SVM is run across genetic, parametric, all features, etc, but not the other models. Why is that?
The manuscript would really benefit from a figure with the main steps of the analyses performed, models tested, datasets employed, etc. It's hard to get the big picture as it is now. Results/Evolutionary characteristics of ISGs: Paragraph between lines 131-148:
I think the window size used (mentioned in the text) should be added to the Figure 2 caption
What's the vertical dashed line? In the text, you say that those at the left of this line are IRGs, but I don't understand the meaning of that vertical line (-0.9 log fold change). This explanation, which I didn't see, should be added to the figure caption also.
From the text, I understand that in the subfigures in Figure 2 you have IRGs, non-ISGs and ISGs. Would it be possible, or meaningful for the reader, to add an extra vertical line to separate them? Results/Differences in the coding region of the canonical transcripts: Paragraph between lines 193-208:
If GC-content is underrepresented in ISGs more than non-ISGs, the ApT and TpA should be expected to be more enriched in ISGs, right? Sounds like a redundant analysis. I would expect these two sequencederived features to be correlated. If this is the case, maybe it would be better to highlight other features instead of a correlated/expected one?
Figure 4: here the authors divided the parametric set of features into four categories and compared their representations among ISGs, non-ISGs and background genes. The figure shows p-values of the tests on the y-axis, and the four categories of features on the x-axis. I think it's important to run a negative control: could you please run these tests again, say, 100 times, with gene IDs/names shuffled, and check whether some of these results also appear in these null simulations? Maybe you can keep the same figure, but remove those also found in the null simulations. Paragraph between lines 209-227:
Is it possible that the comparison of codons frequencies (third category of features) is correlated with previous findings (like GC content or ApT/TpA enrichment)? If so, would it be possible that maybe the analysis is also expected or redundant? For example, in ISGs there is an underrepresentation of GCcontent, and you also found that ISGs there is an underrepresentation of "CAG" codons. I might be missing something, but aren't these expected to be correlated? Results / Differences in the protein sequence: Paragraph between lines 302-323:
Figure 6: I would suggest adding the same negative control suggested before. Results / Differences in network profiles
I think it's important to define what are all those eight features in the network analyses (closeness, betweenness, etc), otherwise it's hard to follow what comes next. Results / Features highly associated with the level of IFN stimulations
Figures 9 and 10: it would be good to add the sign of the correlation in the figure, in addition to mentioning it in the caption (as it is now). Results / Difference in feature representation of interferon-repressed genes and genes with low levels of expression
Given the unique patterns or differences between non-ISG class and IRG class, wouldn't it be better to perform different analyses excluding IRG genes? The authors also acknowledge these risks in lines 539-

Results / Implementation with machine learning framework

It was hard for me to understand the workflow in this section: you used different machine learning models applied to distinct features sets, for example. Why don't you apply the same set of models to the same set of features? I think this section needs an initial paragraph with a global description of what you are trying to do.
For example, I don't think I understand very well the concept of "disruptive feature". What does it mean?
Table 3: I don't understand the threshold selection here. I guess you refer to classification or decision threshold from a model that outputs a probability of a gene to be ISG or non-ISG. First, I think there should be a line separating each performance measure to clearly show those that are "Thresholddependent" and "Threshold independent"
I also understand that, during cross-validation, you selected for each model/feature set combination, the threshold that maximized the MCC (this is explained in Table 3 as a footnote, but it should be more explicitly mentioned in the text).
Table 3: What is the "Optimum" set of features? Why is this "Optimium set" only used with SVM?
How does the "AUC-driven subtractive iteration algorithm (ASI)" compare with other feature selection algorithms.
Table 5: you mention this in the text, but it would be good to have an extra column indicating which datasets were used for training and which are for testing.
Figure 13: it would be good to have the AUROC in the figure, not only the curves. Web-server:
I think, in general, that the web application needs to be more intuitive and have more documentation. For example, the main interface says "Predict your human genes of interest", what does that mean? What does it predict?

Reviewer2: Muthukumaran Venkatachalapathy

First of all, this manuscript is well-written after a thorough research investigation. I enjoyed reading about interferons, interferon stimulating genes (ISGs), mechanisms and signalling pathways. In the introduction, the authors have highlighted the different methods (including other bioinformatics databases) available to identify ISGs and their potential pitfalls. This unmet need is addressed using in silico approaches which were used to classify interferon stimulating genes from non-stimulating ones in human fibroblast cells. Here, the authors have applied a combination of expression data and sequential/compositional features and designed a machine learning model for the prediction of ISGs from non-ISGs. Apart from features like duplication, alternative splicing, mutation and presence of multiple ORFs, the authors extracted various sequential features and found them to be correlated well with ISG prediction. For example, ISGs are prone to GC depletion and a significant difference in the codon usage among ISGs was found. In that context, the authors claim that ISGs are evolutionarily less conserved, codon usage features, genetic composition features, proteomic composition features and sequence patterns (especially like SLNPs and SLAAPs) are optimal parameters that can cumulatively help in differentiating ISGs from non-ISGs. When it comes to building a machine learning model, the authors faced challenges due to similarities between ISGs and IRGs. They have experimented using different algorithms for model building ranging from the decision tree, and random forest and found decent results with support vector machine. Limitation: Model Prediction accuracy was close to 70% for type I and III IFN and it performed below par when it comes to predicting ISGs activated by type II IFN system. There is scope to improvise the model prediction accuracy and extend its usage to type II IFN systems. If the authors could briefly add few points on how to improve the model accuracy and also highlight the application/impact of this work in their discussion, that would help scientists from other background to resonate with this manuscript. Relevance: I believe there are inherent attributes (genetic, compositional, expression) with ISGs which may facilitate or even elevate their expression after IFN stimulation. On the other end, I think these properties may also be leveraged by the viruses to escape or evolve from IFN mediated antiviral response. This study is relevant during the on-going pandemic, this bioinformatics tool can help design better drug target and may indirectly aid in developing novel antiviral compounds. I recommend this work for publication without any changes.

Read the original source

Defining the Characteristics of Type I Interferon Stimulated Genes: Insight from Expression Data and Machine Learning

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Author summary

Article activity feed

Human Cytomegalovirus Strain Specific Differences in Protein Expression of Type I IFN Pathway Proteins Do Not Impact Virus Replication.

Identification of Fish Interferon Stimulated Genes and Their Antiviral Mechanisms

Non-Coding RNA: Architects of Cellular Complexity and Agents of Malignancy

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Author summary

Article activity feed

Related articles

Human Cytomegalovirus Strain Specific Differences in Protein Expression of Type I IFN Pathway Proteins Do Not Impact Virus Replication.

Identification of Fish Interferon Stimulated Genes and Their Antiviral Mechanisms

Non-Coding RNA: Architects of Cellular Complexity and Agents of Malignancy