Benchmarking long-context genome language models on biosynthetic gene clusters
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Recent advances in language models for natural language processing have spread to the field of genomics, driving the development of genome language models (gLMs) to decipher genomic information. Cutting-edge long-context gLMs are promising approaches for understanding and designing biological complexity, but their evaluation remains underdeveloped. In this study, we introduce BGCs-Bench, a unified benchmark focused on biosynthetic gene clusters for assessing long-range genomic modeling on three downstream tasks: biosynthetic class prediction, taxonomic classification and coding sequence annotation. Using BGCs-Bench, we perform systematic and layer-wise evaluations of the embedding representations of long-context gLMs, demonstrating that layer selection is crucial for downstream task performance. In addition to the evaluation results, the logit lens analysis of autoregressive gLMs suggests that StripedHyena-based models consist of earlier layers to encode biologically meaningful information from input DNA sequences and deeper layers to optimize embeddings for sequence generation. These findings provide insights for more effective development and application of long-context gLMs.