Rewriting protein alphabets with language models

Lorenzo Pantolini
Gabriel Studer
Laura Engist
Ieva Pudžiuvelytė
Florian Pommerening
Andrew Mark Waterhouse
Gerardo Tauriello
Martin Steinegger
Torsten Schwede
Janani Durairaj

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Detecting remote homology with speed and sensitivity is crucial for tasks like function annotation and structure prediction. We introduce a novel approach using contrastive learning to convert protein language model embeddings into a new 20-letter alphabet, TEA, enabling highly efficient large-scale protein homology searches. Searching with our alphabet performs on par with and complements structure-based methods without requiring any structural information, and with the speed of sequence search. Ultimately, we bring the exciting advances in protein language model representation learning to the plethora of sequence bioinformatics algorithms developed over the past century, offering a powerful new tool for biological discovery.

Version published to 10.1101/2025.11.27.690975 on bioRxiv
Nov 28, 2025

A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025
Emergence of Biological Structural Discovery in General-Purpose Language Models

This article has 1 author:
1. Liang Wang
This article has no evaluationsLatest version Jan 8, 2026
GTcomplex: Spatial indexing-powered search and alignment of macromolecular complexes

This article has 1 author:
1. Mindaugas Margelevicius
This article has no evaluationsLatest version Jan 22, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Survey on Efficient Protein Language Models

Emergence of Biological Structural Discovery in General-Purpose Language Models

GTcomplex: Spatial indexing-powered search and alignment of macromolecular complexes