Comparing Massively-Multitask Regression Algorithms for Drug Discovery
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Massively-multitask regression models (MMRMs) have revolutionized activity prediction for drug discovery. MMRMs trained on millions of compounds and many thousands of assays can predict bioactivity with accuracy comparable to 4-concentration IC 50 experiments. This report compares six MMRMs: pQSAR, Alchemite, MT-DNN, MetaNN, Macau and IMC. Models were trained by experts in each method, on identical sets of 159 kinase and 4276 diverse ChEMBL assays, employing the same, realistically novel, training/test set splits.MMRMs performed much better than single-task random forest regression (ST-RFR) models for our use-case of imputing full bioactivity profiles for the very sparse compound collection on which the models were trained. Five MMRMs train all models simultaneously, so must leave out test-set measurements for all assays to avoid leakage (i.e. 25% of data). One method trains models one-at-a-time, and trains on all but the test data for that assay (< 1% of data). All algorithms were compared both using 75/25 splits, and when possible, 99+/<1 splits. Many evaluations achieved similar accuracy when tested on the same split. When evaluated on 75/25 splits, all MMRMs performed much worse than when evaluated on 99+/<1% splits. Thus, while many produce comparable high-accuracy final production models (trained on all the data), models that require 75/25 splits cannot evaluate the accuracy of those final models.While outstanding for imputations, MMRMs proved little better than ST-RFR for compounds very unlike the training collection. Thus, MMRMs are best for hit-finding, off-target, promiscuity, MoA, polypharmacology or drug-repurposing within the training collection. Besides accuracy, other pros and cons of each method are discussed.