Interaction, not Dissection: Reframing AI Interpretability
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The extent to which Large Language Models (LLMs) are “interpretable” and whether the information extracted from LLMs via interpretability methods can be seen as corresponding to these models’ “mental states” is currently the subject of lively debate among scientists and philosophers interested in Artificial Intelligence (AI). Some people think that interpretability methods already provide evidence that LLMs genuinely represent the external world in virtue of the fact that we can assign semantic content to their internal states. Other people think that the project of attributing semantic contents to LLMs’ internal states is still very much in its infancy and could take decades. Nevertheless, these people seem to think that, first, such a project is worth pursuing and will eventually succeed and, second, current interpretability methods are where our best hopes reside. In this paper, I argue that the ‘mission’ of LLM interpretability should be re-framed. Instead of understanding interpretability as the project of ‘mapping’ the internal states of LLMs (and assigning representational contents to them), we should understand it as the project of experimentally interacting with LLMs in order to study their behavior during these interactions. Interpretability methods, in other words, are ways to actively manipulate the LLM to elicit certain responses, not ways to ‘dissect’ and ‘catalog’ its inner structures.