From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Log in to save this article

Abstract

Protein language models (pLMs) are powerful predictors of protein structure and function, learning through unsupervised training on millions of protein sequences. pLMs are thought to capture common motifs in protein sequences, but the specifics of pLM features are not well understood. Identifying these features would not only shed light on how pLMs work, but potentially uncover novel protein biology––studying the model to study the biology. Motivated by this, we train sparse autoencoders (SAEs) on the residual stream of a pLM, ESM-2. By characterizing SAE features, we determine that pLMs use a combination of generic features and family-specific features to represent a protein. In addition, we demonstrate how known sequence determinants of properties such as thermostability and subcellular localization can be identified by linear probing of SAE features. For predictive features without known functional associations, we hypothesize their role in unknown mechanisms and provide visualization tools to aid their interpretation. Our study gives a better understanding of the limitations of pLMs, and demonstrates how SAE features can be used to help generate hypotheses for biological mechanisms. We release our code , model weights and feature visualizer .

Article activity feed

  1. We note that improved reconstruction may come at the cost of increased feature absorption (Karvonen et al., 2024)

    Clearly from the nice agreement in Fig. 5, the SAE reconstructions do an excellent job at reconstructing the residual representation at each layer. I am curious about the magnitude of the reconstruction MSE for the hyperparameters covered in Fig. 8. Are there any results you've shared about the SAE training?

    There is a tradeoff between reconstruction error and L0 sparsity, but at what point are you learning more about only the SAEs than ESM2 itself?

  2. We developed a latent visualizer, InterProt, to streamline the process of identifying features.

    InterProt is an amazing tool for sorting through all of these findings.

    The Fig. 3C plot is also very nice for a global view of the learned latent features. What do you think about the relatively small fraction of "interesting" features (the "structural", "amino acid", "alpha helix", etc., top features on InterProt) compared to the total number of latents? Do you think this is more about our lack of knowledge of protein structure, or are the "uninteresting" latents just generally at a lower conceptual level (like point residue features) than what we find interesting (motifs with structural effects)?

  3. Intriguingly, a large subset of latents appear to be protein family-specific

    It would be very interesting how this effect scales with model size. I'd be curious to see if particularly the larger ESM models cause more general latents to break down into more family specific groupings. This would mesh with some of the evidence of family specific overfitting that keeps popping up in the larger sized models.