Enhancing Chemical Toxicity Predictions with Synthetic SMILES from a Fine-Tuned LLM-Based Chemical Synthesis Generative Model

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The adoption of transformer-based models in toxicity prediction has significantly advanced the field, yet these models continue to struggle with data imbalances inherent in benchmark datasets such as Tox21, Clintox, HIV, and BBBP. This persistent challenge undermines their effectiveness, particularly in minority class predictions where data scarcity prevails. Recent advancements in large language models (LLMs) have demonstrated remarkable capabilities in generating synthetic Simplified Molecular Input Line Entry System (SMILES), providing a novel approach to address these imbalances. In this study, we explore the potential of LLM-generated synthetic SMILES to enhance the training datasets, focusing on the augmentation of minority classes. Our comprehensive experiments on multiple benchmark datasets show that this strategy effectively mitigates class imbalance issue but also substantially improves the minority class prediction accuracy without compromising the overall model performance. For instance, in the Tox21 dataset, we observed an increase in minority class prediction accuracy from 0.707 to 0.965. Similar improvements across other datasets further validate the efficacy of synthetic SMILES augmentation in enhancing both toxicity prediction and broader chemical property assessments.

Article activity feed