Systematic Evaluation of Molecular Descriptors for Machine Learning–Based IC₅₀ Prediction
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Accurate prediction of molecular bioactivity is a critical challenge in early-stage drug discovery, as it enables efficient prioritization of compounds within vast chemical space. Among bioactivity measures, the half-maximal inhibitory concentration (IC₅₀) is widely used to quantify compound potency against specific targets. Machine learning (ML) methods provide powerful tools for modeling IC₅₀ values, but their performance depends strongly on the choice of molecular descriptors. In this study, we systematically compare four descriptor classes-physicochemical properties, MACCS structural keys, Morgan circular fingerprints, and Mordred-generated descriptors for their ability to predict IC₅₀ values against the SSTR2 receptor. Curated and preprocessed datasets were used to train ML models, including ensemble stacking frameworks, to assess descriptor complementarity and robustness. Our results show that MACCS keys consistently outperform other descriptors, achieving R² values close to 0.9, reflecting their ability to capture pharmacophore-relevant structural motifs through predefined SMARTS patterns. To complement predictive benchmarking, SHAP (SHapley Additive exPlanations) analysis was applied to quantify feature contributions, linking statistical importance to chemically interpretable patterns. These results demonstrate the practical utility of substructure-focused fingerprints in ML-driven IC₅₀ prediction and provide guidance for descriptor selection strategies that enhance accuracy, interpretability, and generalizability in computational drug discovery. Scientific Contribution : This study presents a systematic evaluation of four molecular descriptor classes, physicochemical properties, MACCS structural keys, Morgan circular fingerprints, and Mordred descriptors, for their ability to predict IC₅₀ values against the SSTR2 receptor using machine learning models and ensemble frameworks. The results demonstrate that MACCS keys consistently outperform more complex descriptor families, achieving R² values close to 0.9, owing to their SMARTS-based encoding of pharmacophore-relevant structural features. Beyond predictive benchmarking, we employed SHAP (SHapley Additive exPlanations) analysis to link statistical feature importance with chemically interpretable patterns, thereby validating model robustness and providing mechanistic insights into descriptor performance. Collectively, these contributions highlight the practical utility of substructure-focused fingerprints in cheminformatics workflows and provide guidance for selecting interpretable, high-performing descriptors to enhance accuracy, generalizability, and interpretability in computational drug discovery.