GenePT Revisited: Do Better Text Embeddings Make Better Gene Embeddings?

Jonathan G. Hedley
Philip H. S. Torr
Kaspar Märtens

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

GenePT introduced a simple recipe for gene representations: embed each gene’s natural-language description with a general-purpose text embedding model and reuse the resulting vectors across downstream tasks. Since GenePT’s release, embedding models have improved rapidly, with many strong open and commercial encoders benchmarked on suites such as the Massive Text Embedding Benchmark (MTEB). We present a controlled “leaderboard” study that keeps the GenePT pipeline fixed and varies only the embedding backbone. We benchmark contemporary encoders on four diverse gene embedding tasks: gene–gene interaction prediction, gene property classification, cell type classification, and prediction of transcriptomic responses to unseen genetic perturbations. Across these settings, newer backbones consistently outperform the original GenePT backbone ( text-embedding-ada-002 ), achieving improvements of 1–17%, while enabling fully reproducible research by avoiding API dependencies.

Version published to 10.64898/2026.04.16.718976 on bioRxiv
Apr 20, 2026

Guidance for high-quality functional gene embeddings from large language models

This article has 7 authors:
1. Rongyao Huang
2. Yaopan Hou
3. Wuye Zhao
4. Junbing Zhang
5. Jian Lu
6. Yimeng Kong
7. Peng Xu
This article has no evaluationsLatest version May 4, 2026
From nucleotides to semantics: genomic representation learning via joint-embedding predictive architecture

This article has 8 authors:
1. Chengsen Wang
2. Qi Qi
3. Haifeng Sun
4. Zirui Zhuang
5. Bo He
6. Siying Liu
7. Jianxin Liao
8. Jingyu Wang
This article has no evaluationsLatest version Apr 6, 2026
Benchmarking long-context genome language models on biosynthetic gene clusters

This article has 4 authors:
1. Keisuke Hirota
2. Koichi Higashi
3. Ken Kurokawa
4. Takuji Yamada
This article has no evaluationsLatest version May 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Guidance for high-quality functional gene embeddings from large language models

From nucleotides to semantics: genomic representation learning via joint-embedding predictive architecture

Benchmarking long-context genome language models on biosynthetic gene clusters