A Multimodal Semi-Supervised Learning Framework for Pharmaceutical Cocrystals Prediction

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Cocrystal formation is a widely used strategy in solid-state chemistry and pharmaceutical development to improve the solubility, stability, and bioavailability of molecules with otherwise poor physicochemical properties. Identifying viable coformer combinations remains laborious and uncertain. A key but underappreciated challenge is that experimental databases overwhelmingly report successful cocrystals, while unsuccessful attempts are rarely documented, creating biased datasets that cause many machine-learning models to make overly optimistic and unreliable predictions when applied to new chemical systems. Here, we address this limitation by reframing cocrystal prediction as a learning problem with missing negative information and by adopting a conservative strategy that focuses on identifying molecular pairs that are very unlikely to form cocrystals. We leverage multiple, independent molecular descriptions—including structural, electronic, and physicochemical characteristics—that provide complementary views for identifying reliable negatives, and use their agreement to exclude implausible combinations from large sets of untested pairs. These highly confident pseudo-negative examples are then used to mitigate data imbalance and to fine-tune a pretrained graph attention network for cocrystal prediction. Across large and chemically diverse datasets, this data-centric strategy significantly improves the reliability and generalization of cocrystal prediction models compared with existing deep-learning approaches, demonstrating that carefully correcting for missing negative information is critical for making computational screening more realistic and more useful for guiding future experimental discovery.

Article activity feed