Transfer learning using generative pretrained genomic DNA models for predicting perturbation-induced changes in gene expression
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Transfer learning applied to genomic DNA models has the potential to improve predictive capabilities, especially when target-domain datasets and computational resources are limited. Despite its promise, the practical effectiveness of transfer learning in genomic DNA models, particularly for predicting gene expression changes due to perturbations, has not been thoroughly investigated. This study aimed to systematically evaluate the performance and utility of transfer learning approaches using genomic DNA models to accurately predict perturbation-induced gene expression. Results We benchmarked three genomic DNA models across 12 distinct datasets containing perturbation-induced gene expression data to identify optimal conditions for effective transfer learning. Notably, perturbation-induced gene expression data were not included in the pre-training of these genomic DNA models. Among these, the Enformer model consistently generated accurate embeddings, demonstrating superior clustering performance and gene signature scoring aligned closely with observed experimental data. Additionally, we identified a phenomenon termed "genomic neighbouring gene interference," wherein partially overlapping DNA sequences of adjacent genes cause correlated predictions, resulting in both beneficial and detrimental effects on predictive accuracy. Conclusions Our findings highlight the efficacy of transfer learning in genomic DNA models for predicting perturbation-induced gene expression, particularly emphasizing the Enformer model's robust performance. Understanding genomic neighbouring gene interference offers critical insights for refining predictive accuracy in genomic applications. This study provides practical guidance for researchers developing transfer learning strategies and genomic DNA models, paving the way for more accurate and resource-efficient genomic predictions.