ESMFold Hallucinates Native-Like Protein Sequences

Jeliazko R. Jeliazkov
Diego del Alamo
Joel D. Karpiak

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)

Abstract

We describe attempts to design protein sequences by inverting the protein structure prediction algorithm ESMFold. State-of-the-art protein structure prediction methods achieve high accuracy by relying on evolutionary patterns derived from either multiple sequence alignments (AlphaFold, RosettaFold) or pretrained protein language models (PLMs; ESMFold, OmegaFold). In principle, by inverting these networks, protein sequences can be designed to fulfill one or more design objectives, such as high prediction confidence, predicted protein binding, or other geometric constraints that can be expressed with loss functions. In practice, sequences designed using an inverted AlphaFold model, termed AFDesign, contain unnatural sequence profiles shown to express poorly, whereas an inverted RosettaFold network has been shown to be sensitive to adversarial sequences. Here, we demonstrate that these limitations do not extend to neural networks that include PLMs, such as ESMFold. Using an inverted ESMFold model, termed ESM-Design, we generated sequences with profiles that are both more native-like and more likely to express than sequences generated using AFDesign, but less likely to express than sequences rescued by the structure-based design method ProteinMPNN. However, the safeguard offered by the PLM came with steep increases in memory consumption, preventing proteins greater than 150 residues from being modeled on a single GPU with 80GB VRAM. During this investigation, we also observed the role played by different sequence initialization schemes, with random sampling of discrete amino acids improving convergence and model quality over any continuous random initialization method. Finally, we showed how this approach can be used to introduce sequence and structure diversification in small proteins such as ubiquitin, while respecting the sequence conservation of active site residues. Our results highlight the effects of architectural differences between structure prediction networks on zero-shot protein design.

Arcadia Science
Jun 2, 2023

but less likely to express than sequences rescued by the structure-based design method ProteinMPNN. However, the safeguard offered by the PLM came with steep increases in memory consumption, preventing proteins greater than 150 residues from being modeled on a single GPU with 80GB VRAM.

thank you for being forthcoming directly in the abstract. this investigation is interesting regardless of its relative performance compared to alternative methods.

Read the original source
Version published to 10.1101/2023.05.23.541774 on bioRxiv
May 24, 2023

A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025
The Evolution of the AlphaFold Architecture

This article has 1 author:
1. Y.C.B.J. Dissanayaka
This article has no evaluationsLatest version Jan 9, 2026
Quantum-Assisted Refinement of AlphaFold Protein Structures

This article has 1 author:
1. Parham Ghayour
This article has no evaluationsLatest version Dec 31, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Survey on Efficient Protein Language Models

The Evolution of the AlphaFold Architecture

Quantum-Assisted Refinement of AlphaFold Protein Structures