OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

We introduce OmniGene-4 , a unified bio-language foundation model built on Gemma-4-26B-A4B (30 layers, 128 experts per layer, top-8 routing). We inject 28,028 biological tokens (DNA and protein BPE, Foldseek 3Di, DSSP labels), continue pretraining on a 32.5 GB DNA / protein / natural-language / structural mixture, and run a five-stage supervised fine-tuning pipeline (v2–v5) on 199,576 instruction-format examples across eight task families. The final v5 adds a dual-head architecture : a generation head plus two per-residue classification heads (3Di, DSSP) trained jointly under a 0.5 / 0.5 loss split. v5 reaches 99.40% accuracy on BioPAWS standard protein homology, 82.60% on remote homology (500 pairs), and 93.66% on BixBench — gaining +14.4, +22.6, +6.7 percentage points over the vocabulary-extended Gemma-4-Instruct baseline, and outperforming ESM-2 (650M) by +32.1 pp on the identical remote-homology split. The classification heads reach 78.6% per-residue accuracy on 3Di (chance 5%) and 100% on DSSP (chance 12.5%). MoE router activations further yield a clean CPT/SFT 96% / 4% decomposition of cross-task differentiation, providing direct interpretability of where biological specialization is acquired.

Article activity feed