Chemical Dice Integrator (CDI): A Scalable Framework for Multimodal Molecular Representation Learning

Suvendu Kumar
Saveena Solanki
Mudit Gupta
Sanjay Kumar Mohanty
Shiva Satija
Sonam Chauhan
Subhadeep Duari
Arushi Sharma
Vishakha Gautam
Sakshi Arora
Raidhani Shome
Sourav Sinha
Abhinav Kumar Sharma
Aayushi Mittal
Debarka Sengupta
Natarajan Arul Murugan
Gaurav Ahuja

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The machine learning landscape for molecular property prediction is fragmented, with numerous Featurizers each capturing a narrow, specialized view of chemical structure. This heterogeneity forces a suboptimal choice of representation a priori, limiting model generalizability. We introduce the Chemical Dice Integrator (CDI), a hierarchical framework that unifies six orthogonal molecular representations, physicochemical (Mordred), topological (GROVER), visual (ImageMol), biological (Signaturizer), quantum-mechanical (MOPAC), and linguistic (ChemBERTa), into a single, coherent embedding. The framework consists of CDI-Basic, a two-tiered autoencoder that fuses these modalities, and CDI-Generalised, a Mamba State-Space Model (SSM) that learns a direct, efficient map from SMILES strings to the unified embedding space. Extensive benchmarking across 23 classification (171 tasks) and 10 regression datasets demonstrates that CDI embeddings consistently achieve superior predictive performance compared to individual Featurizers and standard feature aggregation methods. The CDI-Generalised model achieves this performance with exceptional computational efficiency, outperforming deep learning Featurizers in terms of speed and resource overhead. Furthermore, we demonstrate that the CDI embedding is chemically intuitive, allowing for the sensitive distinction of nuanced structural variants, such as chiral enantiomers and kekulized SMILES forms. By bridging multimodal chemical intelligence with scalable, sequence-based inference, CDI offers a strong foundation for molecular machine learning.

Version published to 10.1101/2025.11.11.687860 on bioRxiv
Nov 13, 2025

Integrating Evolutionary and Compositional Features with ML and DL for Robust and Interpretable Druggable Protein Prediction

This article has 5 authors:
1. Mujeebu Rehman
2. Qinghua Liu
3. Muhammad Javed
4. Ali Ghulam
5. Teerath Kumar
This article has no evaluationsLatest version Dec 11, 2025
Representing Molecules with Algebraic Data Types: Beyond SMILES and SELFIES

This article has 2 authors:
1. Oliver Goldstein
2. Samuel March
This article has no evaluationsLatest version Jan 27, 2026
Nuclear-Charge-Guided Mamba with KAN Dynamic Mixture for Molecular Property Prediction

This article has 1 author:
1. Hong Wang
This article has no evaluationsLatest version Dec 30, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Integrating Evolutionary and Compositional Features with ML and DL for Robust and Interpretable Druggable Protein Prediction

Representing Molecules with Algebraic Data Types: Beyond SMILES and SELFIES

Nuclear-Charge-Guided Mamba with KAN Dynamic Mixture for Molecular Property Prediction