Guidance for high-quality functional gene embeddings from large language models

Rongyao Huang
Yaopan Hou
Wuye Zhao
Junbing Zhang
Jian Lu
Yimeng Kong
Peng Xu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) are increasingly used to generate gene embeddings, yet systematic benchmarks of prompting strategies and practical guidance for obtaining biologically meaningful representations remain limited. Here we present GEbench, an evaluation framework for assessing LLM-derived gene embeddings across different tasks, prompting strategies, and LLM architectures. GEbench revealed that embedding quality depends primarily on whether the input text contains explicit functional information, rather than on sparse gene identifiers or model size. Identifier-based embeddings showed weak biological organization, whereas embeddings derived from functional descriptions consistently achieved stronger functional separation and predictive performance. Notably, Self-Des, which extracts embeddings from model-generated gene function descriptions, enabled locally deployable LLMs to generate high-fidelity representations that approach the quality of expert-curated databases. Genome-scale analyses further supported these findings, indicating that explicit functional descriptions are an effective design principle for generating high-quality gene embeddings from LLMs.

Version published to 10.64898/2026.04.30.721875 on bioRxiv
May 4, 2026

GenePT Revisited: Do Better Text Embeddings Make Better Gene Embeddings?

This article has 3 authors:
1. Jonathan G. Hedley
2. Philip H. S. Torr
3. Kaspar Märtens
This article has no evaluationsLatest version Apr 20, 2026
Improving Variant Effect Prediction by Steering Sparse Mechanistic Features in Protein Language Models

This article has 5 authors:
1. Mingqing Wang
2. Meng Yuan
3. Athanasios V. Vasilakos
4. Yonghong He
5. Zhixiang Ren
This article has no evaluationsLatest version May 15, 2026
Benchmarking long-context genome language models on biosynthetic gene clusters

This article has 4 authors:
1. Keisuke Hirota
2. Koichi Higashi
3. Ken Kurokawa
4. Takuji Yamada
This article has no evaluationsLatest version May 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

GenePT Revisited: Do Better Text Embeddings Make Better Gene Embeddings?

Improving Variant Effect Prediction by Steering Sparse Mechanistic Features in Protein Language Models

Benchmarking long-context genome language models on biosynthetic gene clusters