MotifAE Reveals Functional Motifs from Protein Language Model: Unsupervised Discovery and Interpretability Analysis

Chao Hou
Di Liu
Yufeng Shen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Protein motifs are conserved elements that mediate processes such as folding, binding, catalysis, and post-translational modifications. While motif identification is critical for protein study, experimental methods are labor-intensive, only a few hundred motifs are cataloged in databases like ELM, and existing supervised models are typically limited to predicting motifs with a specific function. Here, we present MotifAE, an unsupervised framework for discovering functional motifs from the protein language model ESM2, which captures evolutionary-scale sequence regularities. MotifAE is based on the sparse autoencoder (SAE), an encoder-decoder architecture that projects ESM2 embeddings into a sparse latent space, with an additional local similarity loss that encourages coherent latent feature activations. When benchmarked against known ELM motifs, MotifAE achieves a median AUROC of 0.88, outperforming the standard SAE (0.80). We also calculated Position-specific scoring matrices (PSSMs) for MotifAE features and found that features with similar decoder weights share similar PSSMs. Furthermore, by aligning MotifAE features with experimental data through gated feature selection, we identified features associated with specific properties such as folding stability. Steering these features enabled designing proteins with enhanced stability, as evaluated in silico . Overall, MotifAE provides a general framework for systematic motif discovery and interpretation, with the potential to advance protein function analysis, mutation effect interpretation, and rational protein engineering.

Version published to 10.1101/2025.11.04.686576 on bioRxiv
Nov 5, 2025

Towards functional annotation with latent protein language model features

This article has 3 authors:
1. Jake Silberg
2. Elana Simon
3. James Zou
This article has no evaluationsLatest version Oct 4, 2025
FiGS-MoD: Feature-informed Gibbs Sampling Motif Discovery Algorithm for Mapping Human Signaling Networks

This article has 3 authors:
1. Yitao Sun
2. Yu Xia
3. Jasmin Coulombe-Huntington
This article has no evaluationsLatest version Sep 29, 2025
Graph attention with structural features improves the generalizability of identifying functional sequences at a protein interface

This article has 6 authors:
1. J. Ash
2. I. M. Francino-Urdaniz
3. S. P. Kells
4. C. N. Davis
5. T. A. Whitehead
6. S. D. Khare
This article has no evaluationsLatest version Nov 10, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Towards functional annotation with latent protein language model features

FiGS-MoD: Feature-informed Gibbs Sampling Motif Discovery Algorithm for Mapping Human Signaling Networks

Graph attention with structural features improves the generalizability of identifying functional sequences at a protein interface