RepGene: Toward a Unified Gene Representation Space Robust to Missing Biological Views

Haiyang Hou
Tianyi Xia
Luni Hu
Hua Qin
Yong Zhang
Yuxiang Li
Shuangsang Fang
Lei Cao

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Genes can be described through multiple heterogeneous biological views, including genomic sequence, transcript sequence, protein sequence, textual knowledge, and single-cell expression context, yet existing gene embeddings remain largely modality-specific and difficult to compare or reuse when many views are unavailable. We study a narrower but practically important question: whether pretrained embeddings from these distinct sources can be organized into a shared gene representation interface that remains usable under severe missing-modality conditions. To investigate this question, we introduce RepGene , a lightweight single-branch framework that combines modality adapters, a shared encoder, presence-aware fusion, and self-supervised cross-view objectives to map five biological views into one latent space. Our goal is not to claim a new multimodal learning principle or to establish superiority over all simpler fusion strategies, but to provide an initial technical instantiation for testing whether such a shared interface is feasible in a fixed-feature setting. Under a two-stage protocol in which RepGene is trained self-supervised on frozen upstream embeddings and evaluated by downstream linear probing, we find preliminary evidence that the learned representation is broadly competitive in the full-modality setting and remains informative when only partial modality subsets are observed at inference time. The strongest signal in our study is robustness under missing views: average performance changes are often limited when one modality is removed, and even single-view inference remains non-trivial in the evaluated benchmark regime. These results do not resolve unified biological representation learning, and they should be interpreted in light of incomplete simple-fusion baselines, limited architectural ablation, benchmark dependence, and possible upstream feature exposure. We therefore position RepGene as a feasibility study and a starting point for stronger comparisons, broader benchmarks, and leakage-aware validation.

Version published to 10.64898/2026.06.11.731512 on bioRxiv
Jun 15, 2026

Bio-BLIP: A Multimodal Architecture for Transferable Reasoning in Genomic Variant Interpretation

This article has 4 authors:
1. Anvita Gupta
2. Alejandro Buendia
3. Anshul Kundaje
4. Jure Leskovec
This article has no evaluationsLatest version May 15, 2026
Improving Variant Effect Prediction by Steering Sparse Mechanistic Features in Protein Language Models

This article has 5 authors:
1. Mingqing Wang
2. Meng Yuan
3. Athanasios V. Vasilakos
4. Yonghong He
5. Zhixiang Ren
This article has no evaluationsLatest version May 15, 2026
Guidance for high-quality functional gene embeddings from large language models

This article has 7 authors:
1. Rongyao Huang
2. Yaopan Hou
3. Wuye Zhao
4. Junbing Zhang
5. Jian Lu
6. Yimeng Kong
7. Peng Xu
This article has no evaluationsLatest version May 4, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Bio-BLIP: A Multimodal Architecture for Transferable Reasoning in Genomic Variant Interpretation

Improving Variant Effect Prediction by Steering Sparse Mechanistic Features in Protein Language Models

Guidance for high-quality functional gene embeddings from large language models