Hidden State Genomics: Graph-Based Analysis of Sparse Auto-Encoder Feature Activity in Genomic Language Models

Eliot Kmiec
Samuel O’Brien
Matthew McCoy

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Pre-trained genomic language model (gLM) representations have been anticipated to enable enhanced deep learning predictions on several genomics tasks, but current benchmarking has led to questions over what they actually encode. We studied this with mechanistic interpretability on InstaDeep’s Nucleotide Transformer v2 (500M), training sparse autoencoders across all 24 encoder layers to probe latent features. Correlation-based annotation against reference regulatory tracks was inconsistent across layers and insufficient for causal interpretation. We therefore built typed sequence-to-feature knowledge graphs to explore the SAE feature space and compared cisplatin-binding versus non-binding genomic DNA sequence communities by PageRank centrality, validating candidate features with decoder-based interventions and a CNN binding classifier. Interventions showed asymmetric effects: suppressive features could collapse predictive signal, while binding-associated features shifted predictions cumulatively with the presence of other binding-associated signals. Dependency maps further indicated strong local feature sensitivity within sequences. Together, these results provide evidence that gLM representations encode highly granular sequence syntax and conservation patterns, aligning more strongly with tightly coupled molecular interactions and local biophysical constraints than with complex, distributed regulatory logic. Within the scope of our intervention setting, this pattern is consistent with stronger performance on selected molecular tasks and weaker performance on broader regulatory inference, motivating scalable methods for causal feature annotation.

Version published to 10.64898/2026.05.13.725007 on bioRxiv
May 16, 2026

Improving Variant Effect Prediction by Steering Sparse Mechanistic Features in Protein Language Models

This article has 5 authors:
1. Mingqing Wang
2. Meng Yuan
3. Athanasios V. Vasilakos
4. Yonghong He
5. Zhixiang Ren
This article has no evaluationsLatest version May 15, 2026
BRIDGE-GRN: Role-Aware Bi-Tower Graph Learning with Cross-View Contrast for Directed Gene Regulatory Network Inference

This article has 2 authors:
1. Hao Chen
2. Wenze Ding
This article has no evaluationsLatest version May 14, 2026
From nucleotides to semantics: genomic representation learning via joint-embedding predictive architecture

This article has 8 authors:
1. Chengsen Wang
2. Qi Qi
3. Haifeng Sun
4. Zirui Zhuang
5. Bo He
6. Siying Liu
7. Jianxin Liao
8. Jingyu Wang
This article has no evaluationsLatest version Apr 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Improving Variant Effect Prediction by Steering Sparse Mechanistic Features in Protein Language Models

BRIDGE-GRN: Role-Aware Bi-Tower Graph Learning with Cross-View Contrast for Directed Gene Regulatory Network Inference

From nucleotides to semantics: genomic representation learning via joint-embedding predictive architecture