Genomic Foundationless Models: Pretraining Does Not Promise Performance
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The success of Large Language Models has inspired the development of Genomic Foundation Models (GFMs) through similar pretraining techniques. However, the relationship between pretraining performance and effectiveness in downstream genomic tasks remains unclear. Additionally, the high computational cost of pre-training raises questions about its cost-efficiency. To assess the usefulness of pre-training in genomics, we evaluated seven different GFMs across various bench-marks, comparing them to their counterparts with randomly initialized weights. Surprisingly, we found that randomly initialized models can match or even surpass the performance of pretrained GFMs in finetuning and feature extraction tasks. We also discovered that pretrained GFMs fail to capture clinically relevant genetic mutations, which are crucial for understanding genetic disorders and phenotypic traits. Our results indicate that most of the current pretrained GFMs lack a “foundational” understanding of genomics and provide minimal utility, even for basic tasks such as sequence classification. These findings collectively highlight the need for critically rethinking the pretraining approaches for genomics. Our code is available at https://github.com/m42-health/gfm-random-eval .