Multi-Modal Ensemble Learning for TLR4 Binding Prediction: Addressing Data Scarcity and Leakage in Small Molecule Drug Discovery

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Machine learning approaches for drug discovery often suffer from data leakage and inadequate validation, leading to overoptimistic performance estimates that fail to translate to real-world applications. We present a methodologically rigorous multi-modal ensemble learning framework for predicting TLR4 binding affinity that addresses these critical challenges. Our approach integrates 3D molecular structures, 2D chemical descriptors, and experimental binding data while implementing diversity-preserving preprocessing to handle conformational duplicates without losing valuable chemical information. The ensemble model combines Random Forest, ElasticNet, Ridge, and Bayesian Ridge regressors with advanced feature selection and statistical validation. On a curated dataset of 49 unique TLR4 ligands, our method achieves cross-validation R² = 0.74 ± 0.10 with statistical significance confirmed by permutation testing (p $<$ 0.01). Key findings include the dominance of molecular complexity descriptors over traditional drug-like properties for TLR4 binding prediction. This work demonstrates how proper handling of multi-modal molecular data can yield reliable predictive models suitable for drug discovery applications.

Article activity feed