Ontology pre-training improves machine learning-based predictions for metabolites
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Recent advances in the field of machine learning have shown that integration of expert knowledge improves performances, in particular for complex domains such as biology. Bio-ontologies offer a rich source of curated biological knowledge that can be harnessed to this end. Here, we describe an intuitive and generalisable approach to embed the knowledge contained in a classification hierarchy derived from a bio-ontology into a machine learning model as an intermediate training step between general-purpose pre-training and task-specific fine-tuning in a process that we call ‘ontology pre-training’. We show that this approach leads to an improvement in predictive performance and a reduction in training time for a broad range of prediction tasks relevant to understanding metabolite functions in living systems, using a range of datasets derived from MoleculeNet. We see the biggest improvement for regression tasks, e.g. prediction of lipophilicity and aqueous solubility of molecules, and a robust improvement for most classification tasks. Our approach can be adapted for a wide range of knowledge sources, models and prediction tasks.