A practical tool for maximal information coefficient analysis
This article has been Reviewed by the following groups
Listed in
- Evaluated articles (GigaScience)
Abstract
Background
The ability of finding complex associations in large omics datasets, assessing their significance, and prioritizing them according to their strength can be of great help in the data exploration phase. Mutual information-based measures of association are particularly promising, in particular after the recent introduction of the TICe and MICe estimators, which combine computational efficiency with superior bias/variance properties. An open-source software implementation of these two measures providing a complete procedure to test their significance would be extremely useful.
Findings
Here, we present MICtools, a comprehensive and effective pipeline that combines TICe and MICe into a multistep procedure that allows the identification of relationships of various degrees of complexity. MICtools calculates their strength assessing statistical significance using a permutation-based strategy. The performances of the proposed approach are assessed by an extensive investigation in synthetic datasets and an example of a potential application on a metagenomic dataset is also illustrated.
Conclusions
We show that MICtools, combining TICe and MICe, is able to highlight associations that would not be captured by conventional strategies.
Article activity feed
-
Background
**Reviewer 2: David Reshef **
General comments---This manuscript introduces an open-source implementation of two measures of dependence, MICe and TICe, which together provide a combination of both statistical power and equitability for identifying associations in large data sets. The implementation provided by the authors is a valuable contribution to the community that allows for the easy computation of these measures of dependence, and I'd recommend its acceptance after the authors make the minor edits listed below.Minor Comments---A few minor comments that the authors should be made aware of (but that I didn't want to be public given how minor they are):
- There are a few small type-o's to correct (e.g. coniugate on Pg. 1, line 31; expenses on pg. 2, line 15).
- I would suggest the authors soften the language around the fact …
Background
**Reviewer 2: David Reshef **
General comments---This manuscript introduces an open-source implementation of two measures of dependence, MICe and TICe, which together provide a combination of both statistical power and equitability for identifying associations in large data sets. The implementation provided by the authors is a valuable contribution to the community that allows for the easy computation of these measures of dependence, and I'd recommend its acceptance after the authors make the minor edits listed below.Minor Comments---A few minor comments that the authors should be made aware of (but that I didn't want to be public given how minor they are):
- There are a few small type-o's to correct (e.g. coniugate on Pg. 1, line 31; expenses on pg. 2, line 15).
- I would suggest the authors soften the language around the fact that "an implementation of these two measures and of a statistical procedure to test the significance of each association is still missing." The authors who developed MICe and TICe are simply waiting to post their implementation of MICe and TICe at www.exploredata.net along with the official publication of the most recent paper analyzing these measures in the Annals of Applied Statistics (https://www.e- publications.org/ims/submission/AOAS/user/submissionFile/29563?confirm=583655c8). That said, the implementation in this manuscript submitted to GigaScience is still a valuable contribution as it is open-source (the implementation AOAS will post is not) and provides a more comprehensive procedure to test for significance.
- On Pg. 1, line 31, "which coniugate computational efficiency with good bias/variance properties", isn't quite accurate. I'd change this to "which combine computational efficiency with superior bias/variance properties".
- On Pg. 2, line 5, "has been shown to satisfy the equitability requirement" should be changed to "has been shown to have good equitability" to reflect the fact that equitability is not a binary property, but a continuous one that a measure of dependence can have more or less of.
- On Pg. 2, line 6 - MIC doesn't actually suffer from lack of power, and this fact has been corrected in the literature, so I would recommend using softer language. It was shown in ref. 12 that was cited by the authors that the original perceived bad power of MIC was due to incorrect parameter settings by those who drew that conclusion. When used with appropriate parameters for independence testing, MIC has decent, but not state-of-the-art, power. What is accurate, however, is that MICe and TICe improved upon the power of MIC, and that TICe has state-of-the-art power.
- On Pg. 2, second column, line 23, regarding the sentence beginning with "With regards to the number of permutations..." (and elsewhere): the number of permutations necessary t operform for any given analysis scales with the number of tests one must correct for (i.e. the number of variable pairs for which a measure of dependence was computed), as the FDR accuracy is inversely proportional to the number of permutations used to compute it, so I'd be careful about saying that a specific number is generally enough for data of any dimensionality.
-
Abstract
A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giy032 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
These peer reviews were as follows:
**Reviewer 1: Simone Romano **
In this paper the authors describe and analyse a series of tools to find complex associations in large omics data sets. At the core of these tools lies the measure of association Maximal Information Coefficient (MIC) which recently received a lot of interest in data mining community. Other than presenting the first publicly available implementation of MIC to date, the authors make available the code for a complete pipeline to identify statistically significant associations between the features in a data set. This involves:- Computing the …
Abstract
A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giy032 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.
These peer reviews were as follows:
**Reviewer 1: Simone Romano **
In this paper the authors describe and analyse a series of tools to find complex associations in large omics data sets. At the core of these tools lies the measure of association Maximal Information Coefficient (MIC) which recently received a lot of interest in data mining community. Other than presenting the first publicly available implementation of MIC to date, the authors make available the code for a complete pipeline to identify statistically significant associations between the features in a data set. This involves:- Computing the Total Information Coefficient (TIC) for each pair of features- Computing their p-value using a permutation test with Monte Carlo simulations- Select the significant pairs using statistical correction for multiple hypotheses- Rank the statistically significant associations according to MICMoreover, the authors analyse the results of their pipeline on synthetic and real data sets.I commend the authors for providing the community with a well-tested implementation of MIC (and its more recent version MIC_e) in various programming languages including C, Matlab, and Python. I also really appreciate publishing a full pipeline to identify associations between features written in Python, which is probably the most popular language in the data science community. Moreover, the paper is well written and the analyses about the effectivity of these tools are convincing. The paper should be accepted for publication in the GigaScience journal. There has been so much discussion about the merit of MIC in the past years since its publication in 2011. I am honestly impressed by MIC's authors efforts to shed light on the theoretical and empirical properties of MIC.
Their effort recently found venue in prestigious journals such as the Proceedings of the National Academy of Science (PNAS) in 2014, the Journal of Machine Learning Research (JMLR) in 2016, and the Annals Of Applied Statistics (AOAS) in 2017. The main criticism about MIC has been its similarity to one of the many estimators of mutual information. Even though MIC exploits mutual information, MIC has been shown to not be the same as estimating mutual information [Measuring dependence powerfully and equitably by Reshef et al. in JMLR 2016]. Nonetheless, what strikes me the most is that: in many empirical studies no estimator of mutual information has the same performance of MIC in terms of equitability. Being equitability a very intuitive property, I do understand why researchers and data mining practitioners value MIC.I have only one concern about the methodology of screening associations with TIC and ranking only the selected ones with MIC. Possibly if we are interested just in equitability, MIC should be the only association measure to be employed in the analysis. However, given than TIC shows to have more power the MIC [An Empirical Study of the Maximal and Total Information Coefficients and Leading Measures of Dependence by Reshef et al. in AOAS 2017], I guess that the associations that MIC would deem as significant would be a subset of the significant associations for TIC.Minor comments:- It would be great to describe the Storey's method to control the FDR in the paper to make it self-contained; It would be also great to briefly describe the procedure to control the FWER; - A table describing the difference between the data sets SD1 and SD2 would be informative. Possibly a line describing the Madelon semi-synthetic data sets would be useful too;-
The authors discuss a great insight on MIC when they say that: "associations between informative/redundant and redundant/redundant variables were significant also for a lower number of samples". It would be nice to have a visual example about these type of associations;- Figure 4 b. I guess discussing a decreasing FN is the same as discussing increasing power. Changing the FN plot in a power plot would make the paper more coherent: e.g. as in Figure 2 a;- "coniugate" in the abstract -> conjugate. Maybe better to reformulate this sentence as it is not very clear; Simone Romano
-
-