Explaining Brain Computation Through Mechanistic Interpretability of Deep Neural Networks
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Deep neural networks (DNNs) are increasingly used to model brain computation, yet the principles linking their internal operations to neural mechanisms remain elusive. We propose a framework which leverages a recent advance in the interpretability of DNNs---mechanistic interpretability---to uncover shared algorithmic structure between DNNs and the brain via two parallel pathways. The Mechanism-to-Brain pathway uses interpretability to extract computational mechanisms from models that generate hypotheses about neural implementation. The Brain-to-Mechanism pathway begins from observed brain–model correspondences and applies interpretability to identify model components that could instantiate similar computations. These pathways interact through mutual constraints, ensuring that advances in one domain inform and delimit inquiry in the other, and through iterative refinement, where discrepancies between models and data drive the reciprocal revision of both mechanistic analyses and neuroscientific hypotheses. Together, they establish a principled route toward causal, mechanistic explanations of cognition, positioning DNNs as computational model organisms for probing the algorithms of the brain.