Interaction, not Dissection: Reframing AI Interpretability

Alessandra Buccella

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The extent to which Large Language Models (LLMs) are “interpretable” and whether the information extracted from LLMs via interpretability methods can be seen as corresponding to these models’ “mental states” is currently the subject of lively debate among scientists and philosophers interested in Artificial Intelligence (AI). Some people think that interpretability methods already provide evidence that LLMs genuinely represent the external world in virtue of the fact that we can assign semantic content to their internal states. Other people think that the project of attributing semantic contents to LLMs’ internal states is still very much in its infancy and could take decades. Nevertheless, these people seem to think that, first, such a project is worth pursuing and will eventually succeed and, second, current interpretability methods are where our best hopes reside. In this paper, I argue that the ‘mission’ of LLM interpretability should be re-framed. Instead of understanding interpretability as the project of ‘mapping’ the internal states of LLMs (and assigning representational contents to them), we should understand it as the project of experimentally interacting with LLMs in order to study their behavior during these interactions. Interpretability methods, in other words, are ways to actively manipulate the LLM to elicit certain responses, not ways to ‘dissect’ and ‘catalog’ its inner structures.

Version published to 10.31234/osf.io/a8m9d_v1 on OSF Preprints
Oct 31, 2025

Sense-Making Reconsidered: Large Language Models and the Blind Spot of Embodied Cognition

This article has 1 author:
1. Tom Froese
This article has no evaluationsLatest version Nov 20, 2025
How humans process language (they think) is machine generated

This article has 3 authors:
1. Gabor Brody
2. Daniel Asherov
3. Athulya Aravind
This article has no evaluationsLatest version Oct 6, 2025
Transparency and Explainability Focus: Making AI Decisions Interpretable to Humans

This article has 1 author:
1. Aiperi Zhenishova
This article has no evaluationsLatest version Nov 18, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Sense-Making Reconsidered: Large Language Models and the Blind Spot of Embodied Cognition

How humans process language (they think) is machine generated

Transparency and Explainability Focus: Making AI Decisions Interpretable to Humans