Chemical Dice Integrator (CDI): A Scalable Framework for Multimodal Molecular Representation Learning

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The machine learning landscape for molecular property prediction is fragmented, with numerous Featurizers each capturing a narrow, specialized view of chemical structure. This heterogeneity forces a suboptimal choice of representation a priori, limiting model generalizability. We introduce the Chemical Dice Integrator (CDI), a hierarchical framework that unifies six orthogonal molecular representations, physicochemical (Mordred), topological (GROVER), visual (ImageMol), biological (Signaturizer), quantum-mechanical (MOPAC), and linguistic (ChemBERTa), into a single, coherent embedding. The framework consists of CDI-Basic, a two-tiered autoencoder that fuses these modalities, and CDI-Generalised, a Mamba State-Space Model (SSM) that learns a direct, efficient map from SMILES strings to the unified embedding space. Extensive benchmarking across 23 classification (171 tasks) and 10 regression datasets demonstrates that CDI embeddings consistently achieve superior predictive performance compared to individual Featurizers and standard feature aggregation methods. The CDI-Generalised model achieves this performance with exceptional computational efficiency, outperforming deep learning Featurizers in terms of speed and resource overhead. Furthermore, we demonstrate that the CDI embedding is chemically intuitive, allowing for the sensitive distinction of nuanced structural variants, such as chiral enantiomers and kekulized SMILES forms. By bridging multimodal chemical intelligence with scalable, sequence-based inference, CDI offers a strong foundation for molecular machine learning.

Article activity feed