Benchmarking Molecular Representations for Aqueous Solubility Prediction: The Impact of Inductive Bias and Scaffold Splitting in Low-Data Regimes
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate prediction of aqueous solubility (logS) is a critical bottleneck in the early stages of drug discovery and formulation. While Graph Neural Networks (GNNs) have emerged as state-of-the-art architectures for molecular property prediction, their efficacy compared to classical feature engineer- ing remains contested in low-data regimes. In this study, we perform a rigorous comparative analysis of three molecular representation strategies—explicit physicochemical descriptors, high-dimensional Morgan fingerprints, and end-to-end graph embeddings—evaluated on the ESOL dataset ( N = 1 , 128). To simulate realistic prospective evaluation, we employ a Murcko scaffold split, ensuring that the test set contains novel chemotypes distinct from the training distribution. Our results demonstrate that a Multi-Layer Perceptron (MLP) trained on domain-specific descriptors (e.g., LogP, Molecular Weight) achieves superior per- formance (MAE = 0 . 73, R 2 = 0 . 75), significantly outperforming both Morgan fingerprints ( R 2 = 0 . 14) and Graph Convolutional Networks ( R 2 = 0 . 54). This suggests that for small-scale datasets, the inductive bias provided by explicit physical features outweighs the representation learning capabilities of GNNs. Furthermore, we implement a Deep Ensemble framework to quantify predictive uncertainty. We find a strong correlation between ensemble variance and prediction error, validating the use of uncertainty estimation as a reliability filter for out-of-domain screening. These findings advocate for a ”physics-first” approach when applying deep learning to small, sparse chemical datasets.