Boosting Pre-trained Model with Silica Nanoparticles Cellular Toxicity Prediction
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Silica nanoparticles have been widely adopted as carriers for drug delivery and components of multifunctional nanocomposites, but potentially lead to off-target accumulation and subsequent cytotoxic effects. Previous works explored data-driven methods to improve the evaluation efficiency and supporting the rational design of nanomedicines. However, there are still two challenges needs to be considered. The one is data leakage problem, as previous methods incorporate either evaluation stage features (e.g., Viability_indicator, Positive_control) or rely on one-hot encoding that requires prior knowledge of all categorical values, leading to data leakage risk. Another is poor model generalizability, since one-hot encoding fixes feature space dimensions, causing model failure when facing unseen categorical values. In this work, we propose a pre-trained model based framework for silica nanoparticles Cellular Toxicity Prediction. To address the data leakage problem, we first removed features that comes from the drug evaluation stage such as Viability_indicator, Positive_control, SiO 2 NP_label, Interference_testing, and Assay_viability. And then we utilize the embedding layer from the TabPFN to process the original categorical values into dense vectors. To improve the model generalizability, we employ in-context learning on pre-trained TabPFN, which has already learned a large number of patterns from amount of synthetic data. The model only needs to change the output prediction distribution through in-context learning without retraining the model or even adjusting the model parameters. Experimental results on publicly available dataset demonstrates that our framework not only achieves state-of-the-art classification performance but also effectively mitigates data leakage and improves generalizability for novel nanoparticle formulations. The code and data are shared in https://github.com/AppleMax1992/pretrained_nanosilica.