Benchmarking Molecular Representations for Aqueous Solubility Prediction: The Impact of Inductive Bias and Scaffold Splitting in Low-Data Regimes

Mudassir Ur Rahman

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Accurate prediction of aqueous solubility (logS) is a critical bottleneck in the early stages of drug discovery and formulation. While Graph Neural Networks (GNNs) have emerged as state-of-the-art architectures for molecular property prediction, their efficacy compared to classical feature engineer- ing remains contested in low-data regimes. In this study, we perform a rigorous comparative analysis of three molecular representation strategies—explicit physicochemical descriptors, high-dimensional Morgan fingerprints, and end-to-end graph embeddings—evaluated on the ESOL dataset ( N = 1 , 128). To simulate realistic prospective evaluation, we employ a Murcko scaffold split, ensuring that the test set contains novel chemotypes distinct from the training distribution. Our results demonstrate that a Multi-Layer Perceptron (MLP) trained on domain-specific descriptors (e.g., LogP, Molecular Weight) achieves superior per- formance (MAE = 0 . 73, R ² = 0 . 75), significantly outperforming both Morgan fingerprints ( R ² = 0 . 14) and Graph Convolutional Networks ( R ² = 0 . 54). This suggests that for small-scale datasets, the inductive bias provided by explicit physical features outweighs the representation learning capabilities of GNNs. Furthermore, we implement a Deep Ensemble framework to quantify predictive uncertainty. We find a strong correlation between ensemble variance and prediction error, validating the use of uncertainty estimation as a reliability filter for out-of-domain screening. These findings advocate for a ”physics-first” approach when applying deep learning to small, sparse chemical datasets.

Version published to 10.21203/rs.3.rs-9059650/v1 on Research Square
Mar 23, 2026

Deep Learning Foundation Models from Classical Molecular Descriptors

This article has 7 authors:
1. William Green
2. Jackson Burns
3. Akshat Shirish Zalte
4. Charlles Abreu
5. Jochen Sieg
6. Christian Feldmann
7. Miriam Mathea
This article has no evaluationsLatest version Mar 16, 2026
A Multimodal Semi-Supervised Learning Framework for Pharmaceutical Cocrystals Prediction

This article has 3 authors:
1. Sohrab Rohani
2. Mohammad Ghanavati
3. Seyed Mohamad Moosavi
This article has no evaluationsLatest version Mar 30, 2026
ArcMol Enables Task-Adaptive Spherical Representation Learning for Molecular Property Prediction

This article has 7 authors:
1. Lijuan Chen
2. yurong zou
3. Zhongning Guo
4. Zihan zou
5. Duanyang Qin
6. Dingguo Xu
7. Taijin Wang
This article has no evaluationsLatest version Apr 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Deep Learning Foundation Models from Classical Molecular Descriptors

A Multimodal Semi-Supervised Learning Framework for Pharmaceutical Cocrystals Prediction

ArcMol Enables Task-Adaptive Spherical Representation Learning for Molecular Property Prediction