Multivariate Mutual Information based Feature Selection for Predicting Histone Post-Translational Modifications in Epigenetic Datasets

V. K. Dhanasekhar
Sibi Raj B. Pillai
Nithya Ramakrishnan

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (Arcadia Science)

Abstract

Mutual information (MI) has been traditionally employed in many areas including biology to identify the non-linear relationships between features. This technique is particularly useful in the biological context to identify features such as genes, histone post-translational modifications (PTMs), transcriptional factors etc . In this work, instead of considering the conventional pairwise MI between PTM features, we evaluate multivariate mutual information (MMI) between PTM triplets, to identify a set of outlier features. This enables us to form a small subset of PTMs that serve as principal features for the prediction of the values of any histone PTM across the epigenome. We also compare the principal MMI features with those from the traditional feature selection techniques such as PCA and Orthogonal Matching Pursuit. We predict all the remaining histone PTM intensities using XGBoost based regression on the selected features. The accuracy of this technique is demonstrated on the ChIP-seq datasets from the yeast and the human epigenomes.The results indicate that the proposed MMI based feature selection technique can serve as a useful method across various biological datasets.

Arcadia Science
Jun 27, 2025

significant negative MMI for the feature set seems to be a reasonable idea to build an independent featureset.

One concern about only looking at negative MMI is that you will miss other types of interactions. For example, high positive MMI (or interaction information) is associated with 'common cause' effects and represent another important class of interactions. Did you consider including both cases or including signed representations?

Read the original source
Arcadia Science
Jun 27, 2025

𝐼 (𝑋 ; 𝑌 ; 𝑍 ) = 𝐻 (𝑋 ) + 𝐻 (𝑌 ) + 𝐻 (𝑍 ) − 𝐻 (𝑋, 𝑌 ) + 𝐻 (𝑋, 𝑍 ) + 𝐻 (𝑌, 𝑍 ) + 𝐻 (𝑋, 𝑌, 𝑍 ) (2)Eq. 3 can be rewritten in terms of MI as𝐼 (𝑋 ; 𝑌 ; 𝑍 ) = 𝐼 (𝑋 ; 𝑌 ) − 𝐼 (𝑋 ; 𝑌 |𝑍 )

This generalization of MI to more than two variables is also known as interaction information and captures the unique information gained by knowing all three variables beyond only knowing subsets of the variables. There are other extensions of MI to more than 2 variables. In particular total correlation allows quantification of the shared information between all three variables. As a result, total correlation has been used for feature selection. Did you consider using total correlation in this case?

Read the original source
Version published to 10.1101/2025.05.28.656539v1 on bioRxiv
Jun 1, 2025

Predicting gene expression changes upon epigenomic drug treatment

This article has 4 authors:
1. Piyush Agrawal
2. Vishaka Gopalan
3. Monjura Afrin Rumi
4. Sridhar Hannenhalli
This article has no evaluationsLatest version May 2, 2025
Harnessing machine learning models for epigenome to transcriptome association studies

This article has 5 authors:
1. Fatemeh Behjati Ardakani
2. Shamim Ashrafiyan
3. Laura Rumpf
4. Dennis Hecker
5. Marcel H Schulz
This article has no evaluationsLatest version May 15, 2025
Decoupled Information Theoretic Feature Selection for Rapid Protein Key Tuning Residue Identification

This article has 3 authors:
1. Haris Saeed
2. Aidong Yang
3. Wei Huang
This article has no evaluationsLatest version May 29, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Article activity feed

Related articles

Predicting gene expression changes upon epigenomic drug treatment

Harnessing machine learning models for epigenome to transcriptome association studies

Decoupled Information Theoretic Feature Selection for Rapid Protein Key Tuning Residue Identification