Benchmarking gene embeddings from sequence, expression, network, and text models for functional prediction tasks

Jeffrey Zhong
Lechuan Li
Ruth Dannenfelser
Vicky Yao

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Accurate, data-driven representations of genes are critical for interpreting high-throughput biological data, yet no consensus exists on the most effective embedding strategy for common functional prediction tasks. Here, we present a systematic comparison of 38 gene embedding methods derived from amino acid sequences, gene expression profiles, protein–protein interaction networks, and biomedical literature. We benchmark each approach across three classes of tasks: predicting individual gene attributes, characterizing paired gene interactions, and assessing gene set relationships while trying to control for data leakage. Overall, we find that literature-based embeddings deliver superior performance across prediction tasks, sequence-based models excel in genetic interaction predictions, and expression-derived representations are well-suited for disease-related associations. Interestingly, network embeddings achieve similar performance to literature-based embeddings on most tasks despite using significantly smaller training sets. The type of training data has a greater influence on performance than the specific embedding construction method, with embedding dimensionality having only minimal impact. Our benchmarks clarify the strengths and limitations of current gene embeddings, providing practical guidance for selecting representations for downstream biological applications.

Version published to 10.1101/2025.01.29.635607 on bioRxiv
Feb 1, 2025

Uncovering miRNA–Disease Associations Through Graph Based Neural Network Representations

This article has 1 author:
1. Alessandro Orro
This article has no evaluationsLatest version Jan 28, 2026
VUS. Life: Leveraging Vector Embeddings for Rapid and Accurate Pathogenicity Prediction of Genetic Variants

This article has 6 authors:
1. Jiawei Wu
2. Marissa Stutzman
3. Michael Muriello
4. Joy Lincoln
5. Donald G. Basel
6. Xiaowu Gai
This article has no evaluationsLatest version Jan 21, 2026
In-Context Learning in Genomic Language Models as a Biological Evaluation Task

This article has 2 authors:
1. Aadit Kapoor
2. Wendy Lee
This article has no evaluationsLatest version Dec 9, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Uncovering miRNA–Disease Associations Through Graph Based Neural Network Representations

VUS. Life: Leveraging Vector Embeddings for Rapid and Accurate Pathogenicity Prediction of Genetic Variants

In-Context Learning in Genomic Language Models as a Biological Evaluation Task