Limitations and Enhancements in Genomic Language Models: Dynamic Selection Approach

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Genomic Language Models (GLMs), which learn from nucleotide sequences, are crucial for understanding biological principles and excel in tasks such as sequence generation and classification. However, state-of-the-art models vary in training methods, architectures, and tokenization techniques, resulting in different strengths and weaknesses. We propose a multi-model fusion approach with a dynamic model selector that effectively integrates three models with distinct architectures. This fusion enhances predictive performance in downstream tasks, outperforming any individual model and achieving complementary advantages. Our comprehensive analysis reveals a strong correlation between model performance and motif prominence in sequences. Nevertheless, overreliance on motifs may limit the understanding of ultra-short core genes and the context of ultra-long sequences. Importantly, based on our in-depth experiments and analyses of the current three leading models, we identify unresolved issues and suggest potential future directions for the development of genomic models. The code, data, and pre-trained model are available at https://github.com/Jacob-S-Qiu/glm\_dynamic\_selection.

Article activity feed