Benchmarking long-context genome language models on biosynthetic gene clusters

Keisuke Hirota
Koichi Higashi
Ken Kurokawa
Takuji Yamada

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent advances in language models for natural language processing have spread to the field of genomics, driving the development of genome language models (gLMs) to decipher genomic information. Cutting-edge long-context gLMs are promising approaches for understanding and designing biological complexity, but their evaluation remains underdeveloped. In this study, we introduce BGCs-Bench, a unified benchmark focused on biosynthetic gene clusters for assessing long-range genomic modeling on three downstream tasks: biosynthetic class prediction, taxonomic classification and coding sequence annotation. Using BGCs-Bench, we perform systematic and layer-wise evaluations of the embedding representations of long-context gLMs, demonstrating that layer selection is crucial for downstream task performance. In addition to the evaluation results, the logit lens analysis of autoregressive gLMs suggests that StripedHyena-based models consist of earlier layers to encode biologically meaningful information from input DNA sequences and deeper layers to optimize embeddings for sequence generation. These findings provide insights for more effective development and application of long-context gLMs.

Version published to 10.64898/2026.05.12.724296 on bioRxiv
May 15, 2026

LAMBDA: A Prophage Detection Benchmark for Genomic Language Models

This article has 12 authors:
1. LeAnn M. Lindsey
2. Nicole L. Pershing
3. Keith Dufault-Thompson
4. Ho-jin Gwak
5. Anisa Habib
6. Aaron Schindler
7. Arjun Rakheja
8. June Round
9. W. Zac Stephens
10. Anne J. Blaschke
11. Hari Sundar
12. Xiaofang Jiang
This article has no evaluationsLatest version Mar 26, 2026
Deep-Plant: a supervised foundation model for plant regulatory genomics

This article has 10 authors:
1. Ahmed Daoud
2. Soumyadip Roy
3. Haoxuan Zeng
4. Xinyu Bao
5. Zhenhao Zhang
6. Jiakang Wang
7. Paul Parodi
8. Anireddy SN Reddy
9. Jie Liu
10. Asa Ben-Hur
This article has no evaluationsLatest version Apr 9, 2026
GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations

This article has 6 authors:
1. Yi Shen
2. Guangshuo Cao
3. Jianghong Wu
4. Dijun Chen
5. Cong Feng
6. Ming Chen
This article has no evaluationsLatest version Apr 24, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

LAMBDA: A Prophage Detection Benchmark for Genomic Language Models

Deep-Plant: a supervised foundation model for plant regulatory genomics

GenNA: Conditional generation of nucleotide sequences guided by natural-language annotations