Representing Molecules with Algebraic Data Types: Beyond SMILES and SELFIES
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Benchmarks of molecular machine learning models often treat the molecular representation as a neutral input format, yet the representation defines the syntax of validity, edit operations, and invariances that models implicitly learn. We propose MolADT, a typed intermediate representation (IR) for molecules expressed as a family of algebraic data types that separates (i) constitution via Dietz-style bonding systems, (ii) 3D geometry and stereochemistry, and (iii) optional electronic annotations. By shifting from string edits to operations over structured values, MolADT makes representational assumptions explicit, supports deterministic validation and localized transformations, and provides hooks for symmetry-aware and Bayesian workflows. We provide a reference implementation in Haskell (open-source, archived with DOI) and worked examples demonstrating delocalised/multicentre bonding, validation invariants, reaction extensions, and group actions relevant to geometric learning. Scientific Contribution: We (1) introduce a representation-level framework that treats molecular representations as well-defined syntactic contracts rather than serializations, (2) formalize a layered typed IR capturing constitution/geometry/annotations, and (3) provide an open reference implementation intended to enable more controlled and interpretable benchmarking of molecular ML pipelines.