A Task-Specific Transfer Learning Approach to Enhancing Small Molecule Retention Time Prediction with Limited Data
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Liquid chromatography (LC) is an essential technique for separating and identifying compounds in complex mixtures across various scientific fields. In LC, retention time (RT) is a crucial property for identifying small molecules, and its prediction has been extensively researched over recent decades. The wide array of columns and experimental conditions necessary for effectively separating diverse compounds presents a challenge. Consequently, advanced deep learning for retention time prediction in real-world scenarios is often hampered by limited training data that spans these varied experimental setups. While transfer learning (TL) can leverage knowledge from upstream datasets, it may not always provide an optimal initial point for specific downstream tasks. We consider six challenging benchmark datasets from different LC systems and experimental conditions (100-300 compounds each) where TL from RT datasets under standard condition fails to achieve satisfactory accuracy ( R 2 ≥ 0.8), highlighting the need for more sophisticated TL strategies that can effectively adapt to the unique characteristics of target chromatographic systems under specific experimental conditions. We present a task-specific transfer learning (TSTL) strategy that pre-trains multiple models on distinct large-scale datasets, optimizing each for fine-tuned performance on the specific target task, then integrates them into a single model. Evaluated on five deep neural network architectures across these six datasets through 5-fold cross-validation, TSTL demonstrated significant performance improvements with the average R 2 increasing from 0.587 to 0.825. Furthermore, TSTL consistently outperformed conventional TL across various sizes of training datasets, demonstrating superior data efficiency for RT prediction under various experimental conditions using limited training data.