Evolutionary-scale prediction of atomic-level protein structure with a language model

Zeming Lin
Halil Akin
Roshan Rao
Brian Hie
Zhongkai Zhu
Wenting Lu
Nikita Smetanin
Robert Verkuil
Ori Kabeli
Yaniv Shmueli
Allan dos Santos Costa
Maryam Fazel-Zarandi
Tom Sercu
Salvatore Candido
Alexander Rives

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (Arcadia Science)
@antonescuoana's saved articles (antonescuoana)

Abstract

Recent advances in machine learning have leveraged evolutionary information in multiple sequence alignments to predict protein structure. We demonstrate direct inference of full atomic-level protein structure from primary sequence using a large language model. As language models of protein sequences are scaled up to 15 billion parameters, an atomic-resolution picture of protein structure emerges in the learned representations. This results in an order-of-magnitude acceleration of high-resolution structure prediction, which enables large-scale structural characterization of metagenomic proteins. We apply this capability to construct the ESM Metagenomic Atlas by predicting structures for >617 million metagenomic protein sequences, including >225 million that are predicted with high confidence, which gives a view into the vast breadth and diversity of natural proteins.

Version published to 10.1126/science.ade2574
Mar 17, 2023
Arcadia Science
Nov 4, 2022

Fig. S7 shows results at different MSA depth thresholds. After filtering, there are 104 sequences with MSA depth ≤ 100, 70 sequences with MSA depth ≤ 10, and 22 sequences with MSA depth = 1. Beyond the constraint that no template has TM-score > 0.5, no filtering on the number of templates is performed.

It would be interesting to know if there is anything in common / shared for the proteins for which you can still not predict structures. For example, are they more likely to come from certain environments or environmental conditions (e.g., low temperature samples, high temperature, high salt, etc)? Also is it possible to take into account any of the environmental conditions in the actual structural prediction? For example if samples came from a hydrothermal vent that was at 90C would this be useful in any of the predictions?

Read the original source
Version published to 10.1101/2022.07.20.500902 on bioRxiv
Jul 21, 2022

Emergence of Biological Structural Discovery in General-Purpose Language Models

This article has 1 author:
1. Liang Wang
This article has no evaluationsLatest version Jan 8, 2026
The Evolution of the AlphaFold Architecture

This article has 1 author:
1. Y.C.B.J. Dissanayaka
This article has no evaluationsLatest version Jan 9, 2026
A Survey on Efficient Protein Language Models

This article has 8 authors:
1. Shouren Wang
2. Debargha Ganguly
3. Vinooth Kulkarni
4. Wang Yang
5. Zhuoran Qiao
6. Daniel Blankenberg
7. Vipin Chaudhary
8. Xiaotian Han
This article has no evaluationsLatest version Dec 24, 2025

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Emergence of Biological Structural Discovery in General-Purpose Language Models

The Evolution of the AlphaFold Architecture

A Survey on Efficient Protein Language Models