Mechanistic Interpretability of Transformers: Extracting Maximum Values from Lists

Kaushal Thaker

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The interpretability of artificial intelligence models, particularly machine learning and deep learning models, is a crucial area of research to ensure the safe and reliable deployment of AI systems. This project explores the mechanistic interpretability of transformer models by training a small transformer to perform a synthetic, algorithmic task: finding the maximum value in variable length lists. Inspired by Neel Nanda’s work on mechanistic interpretability, this study aims to reverse engineer the trained transformer model to understand its internal workings. The project involves building a transformer from scratch, training it on the maximum extraction task, and analyzing the model’s attention patterns and decision-making processes. The results provide insights into how transformers solve algorithmic problems, highlighting the differences in approach between models and human reasoning. This research contributes to the broader goal of enhancing the transparency and interpretability of AI models, particularly in understanding their behavior on simple yet fundamental tasks.

Version published to 10.31224/5115
Aug 21, 2025

Explainable Artificial Intelligence (XAI): Investigating Methods to Make AI Algorithms More Interpretable and Transparent

This article has 1 author:
1. Arimondo Scrivano
This article has no evaluationsLatest version Aug 26, 2025
The Emergence of Automated Machine Learning (AutoML): Trends and Future Prospects

This article has 1 author:
1. Arimondo Scrivano
This article has no evaluationsLatest version Sep 4, 2025
Learning with Fewer Bits Across Layers and Time in the Training of Foundation-Scale Transformers

This article has 5 authors:
1. Oliver Hartley
2. Priya Desai
3. Nathaniel Brooks
4. Eleanor Hughes
5. Beverley Marion
This article has no evaluationsLatest version Sep 22, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Explainable Artificial Intelligence (XAI): Investigating Methods to Make AI Algorithms More Interpretable and Transparent

The Emergence of Automated Machine Learning (AutoML): Trends and Future Prospects

Learning with Fewer Bits Across Layers and Time in the Training of Foundation-Scale Transformers