Exploring the Limits of Probes for Latent Representation Edits in GPT Models

Austin L. Davis
Robinson Vasquez Ferrer
Gita Sukthankar

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Probing classifiers are a technique for understanding and modifying the operation of neural networks in which a smaller classifier is trained to use the model’s internal representation to learn a probing task. Similar to a neural electrode array, probing classifiers help both discern and edit the internal representation of a neural network. This article evaluates the use of probing classifiers to modify the internal hidden state of a chess-playing transformer. We contrast the performance of standard linear probes against Sparse Autoencoders (SAEs), a latent space interpretability technique designed to decompose polysemantic concepts into atomic features via an overcomplete basis. Our experiments demonstrate that linear probes trained directly on the residual stream significantly outperform probes based on SAE latents. When quantifying the success of interventions via the probability of legal moves, linear probe edits achieved an 88% success rate, whereas SAE-based edits yielded only 41%. These findings suggest that while SAEs are valuable for specific interpretability tasks, they do not enhance the controllability of hidden states compared to raw vectors. Finally, we show that the residual stream respects the Markovian property of chess, validating the feasibility of applying consistent edits across different time steps for the same board state.

Version published to 10.20944/preprints202601.2229.v1
Jan 29, 2026

EPMORE: Explainable Process Mixture-of-Experts

This article has 7 authors:
1. Wei Sheng
2. Chengzhu Xiao
3. Lunhao Ao
4. Junyan Long
5. Ye Yu
6. Yangguang Jia
7. Qihua Zhang
This article has no evaluationsLatest version Feb 5, 2026
Feature Engineering in the Transformer Era: A Controlled Study on Toxic Comment Classification

This article has 7 authors:
1. Zhanyi Ding
2. Zijing Wei
3. Chao Yang
4. Hailiang Wang
5. Shuo Xu
6. Yixiang Li
7. Xuanjie Chen
This article has no evaluationsLatest version Dec 16, 2025
Discrete Weight Neural Networks: Investigating the Relationship Between Weight Precision and Generalization

This article has 1 author:
1. Avinav Sahoo
This article has no evaluationsLatest version Jan 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

EPMORE: Explainable Process Mixture-of-Experts

Feature Engineering in the Transformer Era: A Controlled Study on Toxic Comment Classification

Discrete Weight Neural Networks: Investigating the Relationship Between Weight Precision and Generalization