Pretraining Improves Prediction of Genomic Datasets Across Species
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Recent studies suggest that deep neural network models trained on thousands of human genomic datasets can accurately predict genomic features, including gene expression and chromatin accessibility. However, training these models is computation- and time-intensive, and datasets of comparable size do not exist for most other organisms. Here, we identify modifications to an existing state-of-the-art model that improve model accuracy while reducing training time and computational cost. Using this stream-lined model architecture, we investigate the ability of models pretrained on human genomic datasets to transfer performance to a variety of different tasks. Models pretrained on human data but fine-tuned on genomic datasets from diverse tissues and species achieved significantly higher prediction accuracy while significantly reducing training time compared to models trained from scratch, with Pearson correlation coefficients between experimental results and predictions as high as 0.8. Further, we found that including excessive training tasks decreased model performance and that this compromised performance could be partially but not completely rescued by fine-tuning. Thus, simplifying model architecture, applying pretrained models, and carefully considering the number of training tasks may be effective and economical techniques for building new models across data types, tissues, and species.