OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability

Liang Wang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We introduce OmniGene-4 , a unified bio-language foundation model built on Gemma-4-26B-A4B (30 layers, 128 experts per layer, top-8 routing). We inject 28,028 biological tokens (DNA and protein BPE, Foldseek 3Di, DSSP labels), continue pretraining on a 32.5 GB DNA / protein / natural-language / structural mixture, and run a five-stage supervised fine-tuning pipeline (v2–v5) on 199,576 instruction-format examples across eight task families. The final v5 adds a dual-head architecture : a generation head plus two per-residue classification heads (3Di, DSSP) trained jointly under a 0.5 / 0.5 loss split. v5 reaches 99.40% accuracy on BioPAWS standard protein homology, 82.60% on remote homology (500 pairs), and 93.66% on BixBench — gaining +14.4, +22.6, +6.7 percentage points over the vocabulary-extended Gemma-4-Instruct baseline, and outperforming ESM-2 (650M) by +32.1 pp on the identical remote-homology split. The classification heads reach 78.6% per-residue accuracy on 3Di (chance 5%) and 100% on DSSP (chance 12.5%). MoE router activations further yield a clean CPT/SFT 96% / 4% decomposition of cross-task differentiation, providing direct interpretability of where biological specialization is acquired.

Version published to 10.64898/2026.05.12.724542 on bioRxiv
May 14, 2026

sdAbs-LLM: Generative Large Language Models For de novo Antibody Design and Agentic Evaluation

This article has 4 authors:
1. Delower Hossain
2. Fuad Al Abir
3. Sixue Zhang
4. Jake Y. Chen
This article has no evaluationsLatest version Apr 21, 2026
A deterministic computational kernel encoded in the human genome

This article has 1 author:
1. Jasmine Levy
This article has no evaluationsLatest version Apr 15, 2026
Skill-Augmented Frontier Agents Nearly Saturate BixBench-Verified-50

This article has 1 author:
1. Xiaoyu Zhang
This article has no evaluationsLatest version May 1, 2026

OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability

Discuss this preprint

Listed in

Abstract

Article activity feed

sdAbs-LLM: Generative Large Language Models For de novo Antibody Design and Agentic Evaluation

A deterministic computational kernel encoded in the human genome

Skill-Augmented Frontier Agents Nearly Saturate BixBench-Verified-50

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

sdAbs-LLM: Generative Large Language Models For de novo Antibody Design and Agentic Evaluation

A deterministic computational kernel encoded in the human genome

Skill-Augmented Frontier Agents Nearly Saturate BixBench-Verified-50