Benchmarking Molecular Representations for Aqueous Solubility Prediction: The Impact of Inductive Bias and Scaffold Splitting in Low-Data Regimes

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Accurate prediction of aqueous solubility (logS) is a critical bottleneck in the early stages of drug discovery and formulation. While Graph Neural Networks (GNNs) have emerged as state-of-the-art architectures for molecular property prediction, their efficacy compared to classical feature engineer- ing remains contested in low-data regimes. In this study, we perform a rigorous comparative analysis of three molecular representation strategies—explicit physicochemical descriptors, high-dimensional Morgan fingerprints, and end-to-end graph embeddings—evaluated on the ESOL dataset ( N = 1 , 128). To simulate realistic prospective evaluation, we employ a Murcko scaffold split, ensuring that the test set contains novel chemotypes distinct from the training distribution. Our results demonstrate that a Multi-Layer Perceptron (MLP) trained on domain-specific descriptors (e.g., LogP, Molecular Weight) achieves superior per- formance (MAE = 0 . 73, R 2 = 0 . 75), significantly outperforming both Morgan fingerprints ( R 2 = 0 . 14) and Graph Convolutional Networks ( R 2 = 0 . 54). This suggests that for small-scale datasets, the inductive bias provided by explicit physical features outweighs the representation learning capabilities of GNNs. Furthermore, we implement a Deep Ensemble framework to quantify predictive uncertainty. We find a strong correlation between ensemble variance and prediction error, validating the use of uncertainty estimation as a reliability filter for out-of-domain screening. These findings advocate for a ”physics-first” approach when applying deep learning to small, sparse chemical datasets.

Article activity feed