OmniGene-4: A Unified Bio-Language MoE Model with Router-Level Interpretability
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We introduce OmniGene-4 , a unified bio-language foundation model built on Gemma-4-26B-A4B (30 layers, 128 experts per layer, top-8 routing). We inject 28,028 biological tokens (DNA and protein BPE, Foldseek 3Di, DSSP labels), continue pretraining on a 32.5 GB DNA / protein / natural-language / structural mixture, and run a five-stage supervised fine-tuning pipeline (v2–v5) on 199,576 instruction-format examples across eight task families. The final v5 adds a dual-head architecture : a generation head plus two per-residue classification heads (3Di, DSSP) trained jointly under a 0.5 / 0.5 loss split. v5 reaches 99.40% accuracy on BioPAWS standard protein homology, 82.60% on remote homology (500 pairs), and 93.66% on BixBench — gaining +14.4, +22.6, +6.7 percentage points over the vocabulary-extended Gemma-4-Instruct baseline, and outperforming ESM-2 (650M) by +32.1 pp on the identical remote-homology split. The classification heads reach 78.6% per-residue accuracy on 3Di (chance 5%) and 100% on DSSP (chance 12.5%). MoE router activations further yield a clean CPT/SFT 96% / 4% decomposition of cross-task differentiation, providing direct interpretability of where biological specialization is acquired.