A practical tool for maximal information coefficient analysis

Davide Albanese
Samantha Riccadonna
Claudio Donati
Pietro Franceschi

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (GigaScience)

Abstract

Background

The ability of finding complex associations in large omics datasets, assessing their significance, and prioritizing them according to their strength can be of great help in the data exploration phase. Mutual information-based measures of association are particularly promising, in particular after the recent introduction of the TICe and MICe estimators, which combine computational efficiency with superior bias/variance properties. An open-source software implementation of these two measures providing a complete procedure to test their significance would be extremely useful.

Findings

Here, we present MICtools, a comprehensive and effective pipeline that combines TICe and MICe into a multistep procedure that allows the identification of relationships of various degrees of complexity. MICtools calculates their strength assessing statistical significance using a permutation-based strategy. The performances of the proposed approach are assessed by an extensive investigation in synthetic datasets and an example of a potential application on a metagenomic dataset is also illustrated.

Conclusions

We show that MICtools, combining TICe and MICe, is able to highlight associations that would not be captured by conventional strategies.

GigaScience
Jun 23, 2022
Background

**Reviewer 2: David Reshef **

General comments---This manuscript introduces an open-source implementation of two measures of dependence, MICe and TICe, which together provide a combination of both statistical power and equitability for identifying associations in large data sets. The implementation provided by the authors is a valuable contribution to the community that allows for the easy computation of these measures of dependence, and I'd recommend its acceptance after the authors make the minor edits listed below.Minor Comments---A few minor comments that the authors should be made aware of (but that I didn't want to be public given how minor they are):
1. There are a few small type-o's to correct (e.g. coniugate on Pg. 1, line 31; expenses on pg. 2, line 15).
2. I would suggest the authors soften the language around the fact …
Background

**Reviewer 2: David Reshef **

General comments---This manuscript introduces an open-source implementation of two measures of dependence, MICe and TICe, which together provide a combination of both statistical power and equitability for identifying associations in large data sets. The implementation provided by the authors is a valuable contribution to the community that allows for the easy computation of these measures of dependence, and I'd recommend its acceptance after the authors make the minor edits listed below.Minor Comments---A few minor comments that the authors should be made aware of (but that I didn't want to be public given how minor they are):

There are a few small type-o's to correct (e.g. coniugate on Pg. 1, line 31; expenses on pg. 2, line 15).

I would suggest the authors soften the language around the fact that "an implementation of these two measures and of a statistical procedure to test the significance of each association is still missing." The authors who developed MICe and TICe are simply waiting to post their implementation of MICe and TICe at www.exploredata.net along with the official publication of the most recent paper analyzing these measures in the Annals of Applied Statistics (https://www.e- publications.org/ims/submission/AOAS/user/submissionFile/29563?confirm=583655c8). That said, the implementation in this manuscript submitted to GigaScience is still a valuable contribution as it is open-source (the implementation AOAS will post is not) and provides a more comprehensive procedure to test for significance.

On Pg. 1, line 31, "which coniugate computational efficiency with good bias/variance properties", isn't quite accurate. I'd change this to "which combine computational efficiency with superior bias/variance properties".

On Pg. 2, line 5, "has been shown to satisfy the equitability requirement" should be changed to "has been shown to have good equitability" to reflect the fact that equitability is not a binary property, but a continuous one that a measure of dependence can have more or less of.

On Pg. 2, line 6 - MIC doesn't actually suffer from lack of power, and this fact has been corrected in the literature, so I would recommend using softer language. It was shown in ref. 12 that was cited by the authors that the original perceived bad power of MIC was due to incorrect parameter settings by those who drew that conclusion. When used with appropriate parameters for independence testing, MIC has decent, but not state-of-the-art, power. What is accurate, however, is that MICe and TICe improved upon the power of MIC, and that TICe has state-of-the-art power.

On Pg. 2, second column, line 23, regarding the sentence beginning with "With regards to the number of permutations..." (and elsewhere): the number of permutations necessary t operform for any given analysis scales with the number of tests one must correct for (i.e. the number of variable pairs for which a measure of dependence was computed), as the FDR accuracy is inversely proportional to the number of permutations used to compute it, so I'd be careful about saying that a specific number is generally enough for data of any dimensionality.
Read the original source
GigaScience
Jun 23, 2022

Abstract

A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giy032 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

These peer reviews were as follows:

**Reviewer 1: Simone Romano **

In this paper the authors describe and analyse a series of tools to find complex associations in large omics data sets. At the core of these tools lies the measure of association Maximal Information Coefficient (MIC) which recently received a lot of interest in data mining community. Other than presenting the first publicly available implementation of MIC to date, the authors make available the code for a complete pipeline to identify statistically significant associations between the features in a data set. This involves:- Computing the …

Abstract

A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giy032 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

These peer reviews were as follows:

**Reviewer 1: Simone Romano **

In this paper the authors describe and analyse a series of tools to find complex associations in large omics data sets. At the core of these tools lies the measure of association Maximal Information Coefficient (MIC) which recently received a lot of interest in data mining community. Other than presenting the first publicly available implementation of MIC to date, the authors make available the code for a complete pipeline to identify statistically significant associations between the features in a data set. This involves:- Computing the Total Information Coefficient (TIC) for each pair of features- Computing their p-value using a permutation test with Monte Carlo simulations- Select the significant pairs using statistical correction for multiple hypotheses- Rank the statistically significant associations according to MICMoreover, the authors analyse the results of their pipeline on synthetic and real data sets.I commend the authors for providing the community with a well-tested implementation of MIC (and its more recent version MIC_e) in various programming languages including C, Matlab, and Python. I also really appreciate publishing a full pipeline to identify associations between features written in Python, which is probably the most popular language in the data science community. Moreover, the paper is well written and the analyses about the effectivity of these tools are convincing. The paper should be accepted for publication in the GigaScience journal. There has been so much discussion about the merit of MIC in the past years since its publication in 2011. I am honestly impressed by MIC's authors efforts to shed light on the theoretical and empirical properties of MIC.

Their effort recently found venue in prestigious journals such as the Proceedings of the National Academy of Science (PNAS) in 2014, the Journal of Machine Learning Research (JMLR) in 2016, and the Annals Of Applied Statistics (AOAS) in 2017. The main criticism about MIC has been its similarity to one of the many estimators of mutual information. Even though MIC exploits mutual information, MIC has been shown to not be the same as estimating mutual information [Measuring dependence powerfully and equitably by Reshef et al. in JMLR 2016]. Nonetheless, what strikes me the most is that: in many empirical studies no estimator of mutual information has the same performance of MIC in terms of equitability. Being equitability a very intuitive property, I do understand why researchers and data mining practitioners value MIC.I have only one concern about the methodology of screening associations with TIC and ranking only the selected ones with MIC. Possibly if we are interested just in equitability, MIC should be the only association measure to be employed in the analysis. However, given than TIC shows to have more power the MIC [An Empirical Study of the Maximal and Total Information Coefficients and Leading Measures of Dependence by Reshef et al. in AOAS 2017], I guess that the associations that MIC would deem as significant would be a subset of the significant associations for TIC.Minor comments:- It would be great to describe the Storey's method to control the FDR in the paper to make it self-contained; It would be also great to briefly describe the procedure to control the FWER; - A table describing the difference between the data sets SD1 and SD2 would be informative. Possibly a line describing the Madelon semi-synthetic data sets would be useful too;-

The authors discuss a great insight on MIC when they say that: "associations between informative/redundant and redundant/redundant variables were significant also for a lower number of samples". It would be nice to have a visual example about these type of associations;- Figure 4 b. I guess discussing a decreasing FN is the same as discussing increasing power. Changing the FN plot in a power plot would make the paper more coherent: e.g. as in Figure 2 a;- "coniugate" in the abstract -> conjugate. Maybe better to reformulate this sentence as it is not very clear; Simone Romano

Read the original source
Version published to 10.1093/gigascience/giy032
Apr 1, 2018
Version published to 10.1101/215855 on bioRxiv
Nov 7, 2017

A Bayesian Informative Shrinkage Approach for Large-scale Multiple Hypothesis Testing (BISHOT): with Applications in Differential Analysis of Omics Data

This article has 3 authors:
1. Ya Su
2. Mary Eunice Joy Z. Clark
3. Chi Wang
This article has no evaluationsLatest version Sep 16, 2025
Evaluation and Aggregation of Active Module Identification Algorithms

This article has 3 authors:
1. Jason Liu
2. Min Xu
3. Jinchuan Xing
This article has no evaluationsLatest version Oct 7, 2025
Bayesian Inference of Posterior Error Probabilities for Disease Mutation Association

This article has 1 author:
1. Guy Karlebach
This article has no evaluationsLatest version Aug 25, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Background

Findings

Conclusions

Article activity feed

Related articles

A Bayesian Informative Shrinkage Approach for Large-scale Multiple Hypothesis Testing (BISHOT): with Applications in Differential Analysis of Omics Data

Evaluation and Aggregation of Active Module Identification Algorithms

Bayesian Inference of Posterior Error Probabilities for Disease Mutation Association